Using Data to Analyse Elite Runners and their Performance through History

1. Introduction

1.1. Aim

The aim of this project is to investigate how the performance of elite runners is linked to factors like height, weight and age. This will be done by analysing historical data about athlete performances through history, mainly in modern Olympic running events. Understanding the relationship between these characteristics and athlete performance would be extremely helpful in developing training plans and improving athlete performance in the future.

1.2. Specific Questions Addressed

Athlete performance is clearly affected by many factors, and this analysis will be limited to just a few of them, dictated mainly by the data available. The specific questions addressed here are:

  1. How have elite runners' performances changed through history?
  2. How have Olympic runners' characteristics (height, weight, age, body mass index) changed through history?
  3. Is there a relationship between Olympic runners' characteristics and their performance?

There will be no particular modelling or machine learning in this analysis because the questions can be answered by visualising the statistics alone.

In [1]:
# Import libraries
import pandas as pd
import chardet                          # For character encoding
import ftfy                             # For fixing encoding issues
from matplotlib import pyplot as plt
from matplotlib import pylab as plb
from datetime import datetime, time
import numpy as np
from fuzzywuzzy import fuzz             # For inexact ("fuzzy") string matching
from fuzzywuzzy import process
from pandas.plotting import register_matplotlib_converters
register_matplotlib_converters()        # For future compatibility when plotting with datetime

2. Data Understanding

2.1. Load Data and Initial Analysis

This analysis will attempt some originality by combining three separate data sets. This allows athlete characteristics to be linked to athlete performances so any relationship between the two can be investigated.

First, load the data sets and briefly examine them.

2.1.1. Olympic Games Results and Athletes, 1896-2016

The first data set is the Olympic Games results and athlete data, 1896-2016.

Source:

https://www.kaggle.com/heesoo37/120-years-of-olympic-history-athletes-and-results/downloads/120-years-of-olympic-history-athletes-and-results.zip/2

In [2]:
# Results data is in the first file:
all_olympics = pd.read_csv('datasets/athlete_events.csv')
all_olympics.head()
Out[2]:
ID Name Sex Age Height Weight Team NOC Games Year Season City Sport Event Medal
0 1 A Dijiang M 24.0 180.0 80.0 China CHN 1992 Summer 1992 Summer Barcelona Basketball Basketball Men's Basketball NaN
1 2 A Lamusi M 23.0 170.0 60.0 China CHN 2012 Summer 2012 Summer London Judo Judo Men's Extra-Lightweight NaN
2 3 Gunnar Nielsen Aaby M 24.0 NaN NaN Denmark DEN 1920 Summer 1920 Summer Antwerpen Football Football Men's Football NaN
3 4 Edgar Lindenau Aabye M 34.0 NaN NaN Denmark/Sweden DEN 1900 Summer 1900 Summer Paris Tug-Of-War Tug-Of-War Men's Tug-Of-War Gold
4 5 Christine Jacoba Aaftink F 21.0 185.0 82.0 Netherlands NED 1988 Winter 1988 Winter Calgary Speed Skating Speed Skating Women's 500 metres NaN
In [3]:
all_olympics.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 271116 entries, 0 to 271115
Data columns (total 15 columns):
ID        271116 non-null int64
Name      271116 non-null object
Sex       271116 non-null object
Age       261642 non-null float64
Height    210945 non-null float64
Weight    208241 non-null float64
Team      271116 non-null object
NOC       271116 non-null object
Games     271116 non-null object
Year      271116 non-null int64
Season    271116 non-null object
City      271116 non-null object
Sport     271116 non-null object
Event     271116 non-null object
Medal     39783 non-null object
dtypes: float64(3), int64(2), object(10)
memory usage: 31.0+ MB

Summary

The full Olympic Games data set contains useful information over a 120 year period about the competitiors (height, weight, age, country of origin, and medal, if they won one).

2.1.2. Olympic track and field times and results

The second data set contains the Olympic track and field times and results. Note it only includes data for medal winners. Source:

https://www.kaggle.com/jayrav13/olympic-track-field-results/downloads/olympic-track-field-results.zip/1

In [4]:
# Data set 2 - Olympic track and field times and results. Source:
# https://www.kaggle.com/jayrav13/olympic-track-field-results/downloads/olympic-track-field-results.zip/1
# There is an additional column in a few of the rows. This is unlabelled so not useful in this analysis.
# Therefore, read explicitly labelled columns and disgard the unlabelled column.
ol_tf = pd.read_csv('datasets/results.csv', names=['Gender', 
                                                    'Event', 
                                                    'Location', 
                                                    'Year', 
                                                    'Medal', 
                                                    'Name', 
                                                    'Nationality', 
                                                    'Result'])
ol_tf.drop(index=0, inplace=True)
ol_tf.head()
Out[4]:
Gender Event Location Year Medal Name Nationality Result
1 M 10000M Men Rio 2016 G Mohamed FARAH USA 25:05.17
2 M 10000M Men Rio 2016 S Paul Kipngetich TANUI KEN 27:05.64
3 M 10000M Men Rio 2016 B Tamirat TOLA ETH 27:06.26
4 M 10000M Men Beijing 2008 G Kenenisa BEKELE ETH 27:01.17
5 M 10000M Men Beijing 2008 S Sileshi SIHINE ETH 27:02.77
In [5]:
ol_tf.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2394 entries, 1 to 2394
Data columns (total 8 columns):
Gender         2394 non-null object
Event          2394 non-null object
Location       2394 non-null object
Year           2394 non-null object
Medal          2394 non-null object
Name           2164 non-null object
Nationality    2394 non-null object
Result         2394 non-null object
dtypes: object(8)
memory usage: 168.3+ KB

Summary

The most useful feature of the track and field results is the detailed running times and event results. This will be linked to the full Olympic data (including its information on the athletes' characteristics) later in the analysis.

2.1.3. Top 1000 Running Performances for all Olympic Distances and the Half Marathon

The third data set contains the top 1000 running performances for each running event.

Source:

https://www.kaggle.com/jguerreiro/running/downloads/running.zip/2

In [6]:
top_running = pd.read_csv('datasets/data.csv')
top_running.head()
Out[6]:
Rank Time Name Country Date of Birth Place City Date Gender Event
0 1 00:01:40.910000 David Rudisha KEN 1988-12-17 1.0 London 2012-09-08 Men 800 m
1 2 00:01:41.010000 David Rudisha KEN 1988-12-17 1.0 Rieti 2010-08-29 Men 800 m
2 3 00:01:41.090000 David Rudisha KEN 1988-12-17 1.0 Berlin 2010-08-22 Men 800 m
3 4 00:01:41.110000 Wilson Kipketer DEN 1970-12-12 1.0 Köln 1997-08-24 Men 800 m
4 5 00:01:41.240000 Wilson Kipketer DEN 1970-12-12 1.0 Zürich 1997-08-13 Men 800 m
In [7]:
top_running.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18244 entries, 0 to 18243
Data columns (total 10 columns):
Rank             18244 non-null int64
Time             18244 non-null object
Name             18244 non-null object
Country          18244 non-null object
Date of Birth    18244 non-null object
Place            18236 non-null float64
City             18244 non-null object
Date             18244 non-null object
Gender           18244 non-null object
Event            18244 non-null object
dtypes: float64(1), int64(1), object(8)
memory usage: 1.4+ MB

Summary

This data set is good because it contains a large number of data points (1000) including finish times for every running discipline. It is not limited to Olympic performances, but all the events are Olympic distances, with the exception of the half marathon.

3. Data Preparation

3.1. Olympic Results Data

3.1.1. Events

In [8]:
print("Number of unique events is {}".format(len(all_olympics['Event'].unique())))
Number of unique events is 765

765 events is far too many to analyse. It also includes some events which have not taken place in the Olympics for a long time. This analysis is focussed on modern running events, so we will extract a subset of the results.

In [9]:
olympic_sports_groups = all_olympics.groupby('Sport')
athletics = olympic_sports_groups.get_group('Athletics')
all_athletics_events = athletics['Event'].unique()
all_athletics_events
Out[9]:
array(["Athletics Women's 100 metres",
       "Athletics Women's 4 x 100 metres Relay",
       "Athletics Men's Shot Put", "Athletics Men's Pole Vault",
       "Athletics Men's High Jump", "Athletics Men's 1,500 metres",
       "Athletics Men's 4 x 100 metres Relay",
       "Athletics Men's Long Jump", "Athletics Women's Javelin Throw",
       "Athletics Men's 5,000 metres", "Athletics Women's Shot Put",
       "Athletics Men's 110 metres Hurdles", "Athletics Women's Marathon",
       "Athletics Men's 100 metres", "Athletics Men's 400 metres Hurdles",
       "Athletics Men's 400 metres", "Athletics Men's Hammer Throw",
       "Athletics Men's 800 metres", "Athletics Men's Marathon",
       "Athletics Men's 4 x 400 metres Relay",
       "Athletics Men's 10,000 metres", "Athletics Women's 3,000 metres",
       "Athletics Men's 200 metres", "Athletics Men's Javelin Throw",
       "Athletics Men's 3,000 metres Steeplechase",
       "Athletics Women's 200 metres", "Athletics Men's Triple Jump",
       "Athletics Women's Long Jump", "Athletics Women's 5,000 metres",
       "Athletics Men's Discus Throw", "Athletics Women's 10,000 metres",
       "Athletics Men's Decathlon", "Athletics Women's Discus Throw",
       "Athletics Women's 1,500 metres", "Athletics Women's Pole Vault",
       "Athletics Women's 4 x 400 metres Relay",
       "Athletics Women's 800 metres", "Athletics Women's 400 metres",
       "Athletics Men's Javelin Throw, Both Hands",
       "Athletics Women's 400 metres Hurdles",
       "Athletics Women's Pentathlon", "Athletics Women's High Jump",
       "Athletics Men's Standing High Jump",
       "Athletics Men's Standing Long Jump",
       "Athletics Men's 20 kilometres Walk",
       "Athletics Men's 50 kilometres Walk",
       "Athletics Women's 100 metres Hurdles",
       "Athletics Men's Discus Throw, Greek Style",
       "Athletics Women's Heptathlon", "Athletics Men's Stone Throw",
       "Athletics Men's Javelin Throw, Freestyle",
       "Athletics Men's Pentathlon (Ancient)",
       "Athletics Women's 3,000 metres Steeplechase",
       "Athletics Men's Shot Put, Both Hands",
       "Athletics Men's 10 kilometres Walk",
       "Athletics Women's 20 kilometres Walk",
       "Athletics Women's 80 metres Hurdles",
       "Athletics Women's Triple Jump",
       "Athletics Women's 10 kilometres Walk",
       "Athletics Men's Cross-Country, Individual",
       "Athletics Men's Cross-Country, Team",
       "Athletics Men's 3,000 metres, Team", "Athletics Men's Pentathlon",
       "Athletics Men's 3,000 metres Walk", "Athletics Men's 5 mile",
       "Athletics Women's Hammer Throw",
       "Athletics Men's 3,200 metres Steeplechase",
       "Athletics Men's Standing Triple Jump",
       "Athletics Men's 4,000 metres Steeplechase",
       "Athletics Men's 5,000 metres, Team",
       "Athletics Men's 1,600 metres Medley Relay",
       "Athletics Men's 60 metres", "Athletics Men's 1,500 metres Walk",
       "Athletics Men's 3 mile, Team",
       "Athletics Men's 3,500 metres Walk",
       "Athletics Men's 10 mile Walk",
       "Athletics Men's Discus Throw, Both Hands",
       "Athletics Men's 200 metres Hurdles",
       "Athletics Men's 56-pound Weight Throw",
       "Athletics Men's 2,500 metres Steeplechase",
       "Athletics Men's All-Around Championship",
       "Athletics Men's 2,590 metres Steeplechase",
       "Athletics Men's 4 mile, Team"], dtype=object)

This is a more manageable list of events. There are still some events here that don't exist in the modern Games. The next step is to remove any events that didn't take place in the most recent summer Games (2016).

In [10]:
modern_athletics_events = athletics[athletics['Year']==2016]['Event'].unique()
modern_athletics_events
Out[10]:
array(["Athletics Men's 5,000 metres", "Athletics Men's 400 metres",
       "Athletics Men's 10,000 metres", "Athletics Women's 200 metres",
       "Athletics Men's Decathlon", "Athletics Men's Marathon",
       "Athletics Women's Shot Put", "Athletics Women's 400 metres",
       "Athletics Men's Shot Put", "Athletics Women's Marathon",
       "Athletics Men's 100 metres", "Athletics Women's 100 metres",
       "Athletics Women's 4 x 100 metres Relay",
       "Athletics Men's 200 metres",
       "Athletics Men's 4 x 100 metres Relay",
       "Athletics Men's High Jump", "Athletics Men's Triple Jump",
       "Athletics Women's Heptathlon", "Athletics Women's Javelin Throw",
       "Athletics Women's Pole Vault",
       "Athletics Women's 20 kilometres Walk",
       "Athletics Women's 3,000 metres Steeplechase",
       "Athletics Men's 3,000 metres Steeplechase",
       "Athletics Women's 800 metres", "Athletics Women's 1,500 metres",
       "Athletics Men's Discus Throw", "Athletics Men's 1,500 metres",
       "Athletics Men's 400 metres Hurdles",
       "Athletics Women's Long Jump",
       "Athletics Men's 110 metres Hurdles",
       "Athletics Women's 100 metres Hurdles",
       "Athletics Women's 5,000 metres",
       "Athletics Men's 4 x 400 metres Relay",
       "Athletics Men's Long Jump", "Athletics Men's 800 metres",
       "Athletics Women's High Jump", "Athletics Men's Javelin Throw",
       "Athletics Women's 4 x 400 metres Relay",
       "Athletics Women's 400 metres Hurdles",
       "Athletics Women's Discus Throw", "Athletics Men's Hammer Throw",
       "Athletics Women's 10,000 metres",
       "Athletics Men's 50 kilometres Walk",
       "Athletics Men's 20 kilometres Walk", "Athletics Men's Pole Vault",
       "Athletics Women's Triple Jump", "Athletics Women's Hammer Throw"],
      dtype=object)
In [11]:
removed_events = set(all_athletics_events).difference(modern_athletics_events)
removed_events
Out[11]:
{"Athletics Men's 1,500 metres Walk",
 "Athletics Men's 1,600 metres Medley Relay",
 "Athletics Men's 10 kilometres Walk",
 "Athletics Men's 10 mile Walk",
 "Athletics Men's 2,500 metres Steeplechase",
 "Athletics Men's 2,590 metres Steeplechase",
 "Athletics Men's 200 metres Hurdles",
 "Athletics Men's 3 mile, Team",
 "Athletics Men's 3,000 metres Walk",
 "Athletics Men's 3,000 metres, Team",
 "Athletics Men's 3,200 metres Steeplechase",
 "Athletics Men's 3,500 metres Walk",
 "Athletics Men's 4 mile, Team",
 "Athletics Men's 4,000 metres Steeplechase",
 "Athletics Men's 5 mile",
 "Athletics Men's 5,000 metres, Team",
 "Athletics Men's 56-pound Weight Throw",
 "Athletics Men's 60 metres",
 "Athletics Men's All-Around Championship",
 "Athletics Men's Cross-Country, Individual",
 "Athletics Men's Cross-Country, Team",
 "Athletics Men's Discus Throw, Both Hands",
 "Athletics Men's Discus Throw, Greek Style",
 "Athletics Men's Javelin Throw, Both Hands",
 "Athletics Men's Javelin Throw, Freestyle",
 "Athletics Men's Pentathlon",
 "Athletics Men's Pentathlon (Ancient)",
 "Athletics Men's Shot Put, Both Hands",
 "Athletics Men's Standing High Jump",
 "Athletics Men's Standing Long Jump",
 "Athletics Men's Standing Triple Jump",
 "Athletics Men's Stone Throw",
 "Athletics Women's 10 kilometres Walk",
 "Athletics Women's 3,000 metres",
 "Athletics Women's 80 metres Hurdles",
 "Athletics Women's Pentathlon"}
In [12]:
indices_to_remove = [athletics.index[i] for i in range(len(athletics)) if athletics['Event'].iloc[i] in removed_events]
modern_athletics = athletics.drop(index=indices_to_remove)
modern_athletics['Event'].unique()
Out[12]:
array(["Athletics Women's 100 metres",
       "Athletics Women's 4 x 100 metres Relay",
       "Athletics Men's Shot Put", "Athletics Men's Pole Vault",
       "Athletics Men's High Jump", "Athletics Men's 1,500 metres",
       "Athletics Men's 4 x 100 metres Relay",
       "Athletics Men's Long Jump", "Athletics Women's Javelin Throw",
       "Athletics Men's 5,000 metres", "Athletics Women's Shot Put",
       "Athletics Men's 110 metres Hurdles", "Athletics Women's Marathon",
       "Athletics Men's 100 metres", "Athletics Men's 400 metres Hurdles",
       "Athletics Men's 400 metres", "Athletics Men's Hammer Throw",
       "Athletics Men's 800 metres", "Athletics Men's Marathon",
       "Athletics Men's 4 x 400 metres Relay",
       "Athletics Men's 10,000 metres", "Athletics Men's 200 metres",
       "Athletics Men's Javelin Throw",
       "Athletics Men's 3,000 metres Steeplechase",
       "Athletics Women's 200 metres", "Athletics Men's Triple Jump",
       "Athletics Women's Long Jump", "Athletics Women's 5,000 metres",
       "Athletics Men's Discus Throw", "Athletics Women's 10,000 metres",
       "Athletics Men's Decathlon", "Athletics Women's Discus Throw",
       "Athletics Women's 1,500 metres", "Athletics Women's Pole Vault",
       "Athletics Women's 4 x 400 metres Relay",
       "Athletics Women's 800 metres", "Athletics Women's 400 metres",
       "Athletics Women's 400 metres Hurdles",
       "Athletics Women's High Jump",
       "Athletics Men's 20 kilometres Walk",
       "Athletics Men's 50 kilometres Walk",
       "Athletics Women's 100 metres Hurdles",
       "Athletics Women's Heptathlon",
       "Athletics Women's 3,000 metres Steeplechase",
       "Athletics Women's 20 kilometres Walk",
       "Athletics Women's Triple Jump", "Athletics Women's Hammer Throw"],
      dtype=object)

This analysis will focus on individual running events. So, now remove the field events and non-running events.

In [13]:
# These are the events to keep for the analysis.
modern_individual_running_events = {"Athletics Women's 100 metres",
                            "Athletics Men's 1,500 metres",
                            "Athletics Men's 5,000 metres",
                            "Athletics Men's 110 metres Hurdles",
                            "Athletics Women's Marathon",
                            "Athletics Men's 100 metres", 
                            "Athletics Men's 400 metres Hurdles",
                            "Athletics Men's 400 metres", 
                            "Athletics Men's 800 metres",
                            "Athletics Men's Marathon",
                            "Athletics Men's 10,000 metres", 
                            "Athletics Men's 200 metres",
                            "Athletics Men's 3,000 metres Steeplechase",
                            "Athletics Women's 200 metres", 
                            "Athletics Women's 5,000 metres",
                            "Athletics Women's 10,000 metres",
                            "Athletics Women's 1,500 metres",
                            "Athletics Women's 800 metres", 
                            "Athletics Women's 400 metres",
                            "Athletics Women's 400 metres Hurdles",
                            "Athletics Women's 100 metres Hurdles",
                            "Athletics Women's 3,000 metres Steeplechase"}
In [14]:
removed_events = set(modern_athletics_events).difference(modern_individual_running_events)
removed_events
Out[14]:
{"Athletics Men's 20 kilometres Walk",
 "Athletics Men's 4 x 100 metres Relay",
 "Athletics Men's 4 x 400 metres Relay",
 "Athletics Men's 50 kilometres Walk",
 "Athletics Men's Decathlon",
 "Athletics Men's Discus Throw",
 "Athletics Men's Hammer Throw",
 "Athletics Men's High Jump",
 "Athletics Men's Javelin Throw",
 "Athletics Men's Long Jump",
 "Athletics Men's Pole Vault",
 "Athletics Men's Shot Put",
 "Athletics Men's Triple Jump",
 "Athletics Women's 20 kilometres Walk",
 "Athletics Women's 4 x 100 metres Relay",
 "Athletics Women's 4 x 400 metres Relay",
 "Athletics Women's Discus Throw",
 "Athletics Women's Hammer Throw",
 "Athletics Women's Heptathlon",
 "Athletics Women's High Jump",
 "Athletics Women's Javelin Throw",
 "Athletics Women's Long Jump",
 "Athletics Women's Pole Vault",
 "Athletics Women's Shot Put",
 "Athletics Women's Triple Jump"}
In [15]:
indices_to_remove = [modern_athletics.index[i] for i in range(len(modern_athletics)) if modern_athletics['Event'].iloc[i] in removed_events]
ol_running = modern_athletics.drop(index=indices_to_remove)
ol_running.head()
Out[15]:
ID Name Sex Age Height Weight Team NOC Games Year Season City Sport Event Medal
26 8 Cornelia "Cor" Aalten (-Strannood) F 18.0 168.0 NaN Netherlands NED 1932 Summer 1932 Summer Los Angeles Athletics Athletics Women's 100 metres NaN
98 34 Jamale (Djamel-) Aarrass (Ahrass-) M 30.0 187.0 76.0 France FRA 2012 Summer 2012 Summer London Athletics Athletics Men's 1,500 metres NaN
148 55 Antonio Abadia Beci M 26.0 170.0 65.0 Spain ESP 2016 Summer 2016 Summer Rio de Janeiro Athletics Athletics Men's 5,000 metres NaN
190 86 Jos Manuel Abascal Gmez M 22.0 182.0 67.0 Spain ESP 1980 Summer 1980 Summer Moskva Athletics Athletics Men's 1,500 metres NaN
191 86 Jos Manuel Abascal Gmez M 26.0 182.0 67.0 Spain ESP 1984 Summer 1984 Summer Los Angeles Athletics Athletics Men's 1,500 metres Bronze

3.1.2. Missing Values

In [16]:
# Check for missing values in each column.
ol_running.isnull().sum()
Out[16]:
ID            0
Name          0
Sex           0
Age         667
Height     2987
Weight     3131
Team          0
NOC           0
Games         0
Year          0
Season        0
City          0
Sport         0
Event         0
Medal     17951
dtype: int64

Many rows have no entry for a medal, and this is expected - many competitors do not win a medal, so there is no special treatment needed for missing values in the Medal feature. There are also a lot of missing values for height, weight and age, these will be examined now.

In [17]:
age_missing = ol_running[ol_running['Age'].isnull()]
weight_missing = ol_running[ol_running['Weight'].isnull()]
height_missing = ol_running[ol_running['Height'].isnull()]
In [18]:
age_missing.head()
Out[18]:
ID Name Sex Age Height Weight Team NOC Games Year Season City Sport Event Medal
314 167 Ould Lamine Abdallah M NaN NaN NaN France FRA 1952 Summer 1952 Summer Helsinki Athletics Athletics Men's 10,000 metres NaN
327 179 Ibrahim Saad Abdel Galil M NaN 176.0 73.0 Sudan SUD 1972 Summer 1972 Summer Munich Athletics Athletics Men's 200 metres NaN
353 194 Moustafa Mounib Abdel Kader M NaN 176.0 67.0 United Arab Republic UAR 1960 Summer 1960 Summer Roma Athletics Athletics Men's 100 metres NaN
505 281 S. Abdul Hamid M NaN NaN NaN India IND 1928 Summer 1928 Summer Amsterdam Athletics Athletics Men's 110 metres Hurdles NaN
506 281 S. Abdul Hamid M NaN NaN NaN India IND 1928 Summer 1928 Summer Amsterdam Athletics Athletics Men's 400 metres Hurdles NaN
In [19]:
weight_missing.head()
Out[19]:
ID Name Sex Age Height Weight Team NOC Games Year Season City Sport Event Medal
26 8 Cornelia "Cor" Aalten (-Strannood) F 18.0 168.0 NaN Netherlands NED 1932 Summer 1932 Summer Los Angeles Athletics Athletics Women's 100 metres NaN
215 104 Gana Abba Kimet M 26.0 NaN NaN Chad CHA 1972 Summer 1972 Summer Munich Athletics Athletics Men's 100 metres NaN
314 167 Ould Lamine Abdallah M NaN NaN NaN France FRA 1952 Summer 1952 Summer Helsinki Athletics Athletics Men's 10,000 metres NaN
322 175 Abdelgani Hassan Abdel Fattah M 31.0 NaN NaN Egypt EGY 1952 Summer 1952 Summer Helsinki Athletics Athletics Men's Marathon NaN
429 237 Ben Ahmed Abdelkrim M 20.0 NaN NaN France FRA 1952 Summer 1952 Summer Helsinki Athletics Athletics Men's 5,000 metres NaN
In [20]:
height_missing.head()
Out[20]:
ID Name Sex Age Height Weight Team NOC Games Year Season City Sport Event Medal
215 104 Gana Abba Kimet M 26.0 NaN NaN Chad CHA 1972 Summer 1972 Summer Munich Athletics Athletics Men's 100 metres NaN
314 167 Ould Lamine Abdallah M NaN NaN NaN France FRA 1952 Summer 1952 Summer Helsinki Athletics Athletics Men's 10,000 metres NaN
322 175 Abdelgani Hassan Abdel Fattah M 31.0 NaN NaN Egypt EGY 1952 Summer 1952 Summer Helsinki Athletics Athletics Men's Marathon NaN
429 237 Ben Ahmed Abdelkrim M 20.0 NaN NaN France FRA 1952 Summer 1952 Summer Helsinki Athletics Athletics Men's 5,000 metres NaN
488 268 Brahim Daoud Abdoulaye M 25.0 NaN 71.0 Chad CHA 1996 Summer 1996 Summer Atlanta Athletics Athletics Men's 200 metres NaN

We now have three groups of rows that have at least one missing value. Now find out if they overlap by using sets:

In [21]:
age_missing_indices = set(age_missing.index)
weight_missing_indices = set(weight_missing.index)
height_missing_indices= set(height_missing.index)
print("The number of rows where both height and weight are missing is {}".format(
                        len(weight_missing_indices.intersection(height_missing_indices))))
print("The number of rows where both age and weight are missing is {}".format(
                        len(age_missing_indices.intersection(weight_missing_indices))))
print("The number of rows where both age and height are missing is {}".format(
                        len(age_missing_indices.intersection(height_missing_indices))))
print("The number of rows where age, height and weight are missing is {}".format(
                        len(age_missing_indices.intersection(height_missing_indices, weight_missing_indices))))
The number of rows where both height and weight are missing is 2961
The number of rows where both age and weight are missing is 537
The number of rows where both age and height are missing is 504
The number of rows where age, height and weight are missing is 504

Of the rows where either height (2987) or weight (3131) are missing, most (2961) of them are missing both height and weight. Of the rows where age is missing (667), most (at least 504) are also missing either weight, height or both. The overlap between the missing data sets is large, which is good news, because it means more of the rows are fully populated, so more of this data is usable without dropping data or imputation. For now, all the data will be kept (not dropping rows with missing data).

3.1.3. Event Labels

The Event feature is a categorical variable. This will be encoded as follows:

  • Strip off the part of the string related to sport (== 'Athletics') - this is not useful anymore.
  • Strip off the gender ("Men's" or "Women's" - because this is already encoded in a separate feature
  • Define a feature 'Track_Flat'. This will cover all the flat (i.e. non-hurdles) track events. It will contain the integer distance of the race in metres, i.e. the 100 m will have a 'Track_Flat' value of 100. Events that are not flat track type will have a 'Track_Flat' value of 0.
  • Define a feature 'Hurdles'. This will be similar to the 'Track_Flat' feature, and will contain the integer distance of the race, or 0 for non-hurdles events.
  • Define a feature 'Road'. This will be the race distance in metrs (42195 m for a marathon, 21098 for a half marathon)
  • Define a feature 'Steeplechase'. This will be 3000 for steeplechase, 0 otherwise.

This method of encoding is chosen because it groups together similar types of events (e.g., hurdles events are treated as a group, flat track events are treated as a separate group) and also separates them by the distance of each event (100m, 200m, etc.)

In [22]:
# Simple string processing in Event column
ol_running['Event'] = ol_running['Event'].str.replace("Athletics Women's ", "")
ol_running['Event'] = ol_running['Event'].str.replace("Athletics Men's ", "")
ol_running['Event'] = ol_running['Event'].str.replace(" metres", "")
ol_running['Event'] = ol_running['Event'].str.replace(",", "")
In [23]:
ol_running.head()
Out[23]:
ID Name Sex Age Height Weight Team NOC Games Year Season City Sport Event Medal
26 8 Cornelia "Cor" Aalten (-Strannood) F 18.0 168.0 NaN Netherlands NED 1932 Summer 1932 Summer Los Angeles Athletics 100 NaN
98 34 Jamale (Djamel-) Aarrass (Ahrass-) M 30.0 187.0 76.0 France FRA 2012 Summer 2012 Summer London Athletics 1500 NaN
148 55 Antonio Abadia Beci M 26.0 170.0 65.0 Spain ESP 2016 Summer 2016 Summer Rio de Janeiro Athletics 5000 NaN
190 86 Jos Manuel Abascal Gmez M 22.0 182.0 67.0 Spain ESP 1980 Summer 1980 Summer Moskva Athletics 1500 NaN
191 86 Jos Manuel Abascal Gmez M 26.0 182.0 67.0 Spain ESP 1984 Summer 1984 Summer Los Angeles Athletics 1500 Bronze

Adding the new columns, copying the values between columns and removing duplicates is repetetive so write a function for this:

In [24]:
def encode_events(df, col, to_replace, replacement):
    """
    Helper function to insert new columns, copy and convert values to the correct column
    """
    # Insert new column
    df.insert(df.columns.get_loc('Event'), col, 0)
    # Copy values across to new column
    df.loc[df['Event'].str.contains(to_replace), col] = df['Event'].str.replace(to_replace, replacement)
    # Remove values from original column
    df.loc[df[col] != 0, 'Event'] = '0'
In [25]:
def string_to_int(df, features):
    """
    Helper function to cast string values to integers.
    """
    for feature in features:
        df[feature] = pd.to_numeric(df[feature], downcast='integer')
In [26]:
new_columns = ['Hurdles', 'Road', 'Steeplechase']
to_replace = [' Hurdles', 'Marathon', ' Steeplechase']
replacement = ['', '42195', '']

for i in range(len(new_columns)):
    encode_events(ol_running, new_columns[i], to_replace[i], replacement[i])
    
ol_running.rename(columns={'Event': 'Track_Flat'}, inplace=True)

3.1.4. Other Processing

Cast Strings to Integers
In [27]:
# Several features now contain strings that would be easier to use as integers.
# Convert these to integers now.
columns_to_int = ['Track_Flat', 'Hurdles', 'Steeplechase', 'Road', 'Year']
string_to_int(ol_running, columns_to_int)
Sex

The other two data sets refer to this as 'Gender'. For ease of comparison, change the name of this feature from 'Sex' to ''Gender'.

In [28]:
ol_running.rename(columns={'Sex': 'Gender'}, inplace=True)
Medal

For ease of comparison with the other data sets, convert 'Gold' to 'G', 'Silver' to 'S', and 'Bronze to 'B'

In [29]:
medals = ['Gold', 'Silver', 'Bronze']
short_medals = ['G', 'S', 'B']

for i in range(len(medals)):
    ol_running.loc[ol_running['Medal'] == medals[i], 'Medal'] = ol_running[ol_running['Medal'] == medals[i]
                                                                       ]['Medal'].str.replace(medals[i], short_medals[i])
Name

The 'Name' column also looks difficult to use:

In [30]:
ol_running['Name'].head()
Out[30]:
26     Cornelia "Cor" Aalten (-Strannood)
98     Jamale (Djamel-) Aarrass (Ahrass-)
148                   Antonio Abadia Beci
190               Jos Manuel Abascal Gmez
191               Jos Manuel Abascal Gmez
Name: Name, dtype: object

There are alternative names/nicknames in parentheses and double quotes. The intention is to use the names later on, so to make this easier, remove sections in parentheses and double quotes, and convert the name string to lowercase. Make this a function so it can be used on the other data sets later on.

In [31]:
def process_names(df):
    """
    Helper function to perform some cleaning on the athlete Name field.
    """
    df.rename(columns={'Name': 'RawName'}, inplace=True)
    df.insert(loc = df.columns.get_loc('RawName'), column = 'Name', value=np.NaN)
    df['Name'] = df['RawName'].str.replace('\"(.*?)\"', '')
    df['Name'] = df['Name'].str.replace('\((.*?)\)', '')
    df['Name'] = df['Name'].str.lower()
In [32]:
process_names(ol_running)
In [33]:
ol_running.head()
Out[33]:
ID Name RawName Gender Age Height Weight Team NOC Games Year Season City Sport Hurdles Road Steeplechase Track_Flat Medal
26 8 cornelia aalten Cornelia "Cor" Aalten (-Strannood) F 18.0 168.0 NaN Netherlands NED 1932 Summer 1932 Summer Los Angeles Athletics 0 0 0 100 NaN
98 34 jamale aarrass Jamale (Djamel-) Aarrass (Ahrass-) M 30.0 187.0 76.0 France FRA 2012 Summer 2012 Summer London Athletics 0 0 0 1500 NaN
148 55 antonio abadia beci Antonio Abadia Beci M 26.0 170.0 65.0 Spain ESP 2016 Summer 2016 Summer Rio de Janeiro Athletics 0 0 0 5000 NaN
190 86 jos manuel abascal gmez Jos Manuel Abascal Gmez M 22.0 182.0 67.0 Spain ESP 1980 Summer 1980 Summer Moskva Athletics 0 0 0 1500 NaN
191 86 jos manuel abascal gmez Jos Manuel Abascal Gmez M 26.0 182.0 67.0 Spain ESP 1984 Summer 1984 Summer Los Angeles Athletics 0 0 0 1500 B

Wrangling of this data set is complete, and from here on the cleaned data frame will always be called ol_running.

3.2. Olympic Track and Field Data

3.2.1. Events

In [34]:
print("Number of unique events is {}".format(len(ol_tf['Event'].unique())))
Number of unique events is 47
In [35]:
ol_tf.head()
Out[35]:
Gender Event Location Year Medal Name Nationality Result
1 M 10000M Men Rio 2016 G Mohamed FARAH USA 25:05.17
2 M 10000M Men Rio 2016 S Paul Kipngetich TANUI KEN 27:05.64
3 M 10000M Men Rio 2016 B Tamirat TOLA ETH 27:06.26
4 M 10000M Men Beijing 2008 G Kenenisa BEKELE ETH 27:01.17
5 M 10000M Men Beijing 2008 S Sileshi SIHINE ETH 27:02.77
In [36]:
all_ol_tf_events = ol_tf['Event'].unique()
all_ol_tf_events
Out[36]:
array(['10000M Men', '100M Men', '110M Hurdles Men', '1500M Men',
       '200M Men', '20Km Race Walk Men', '3000M Steeplechase Men',
       '400M Hurdles Men', '400M Men', '4X100M Relay Men',
       '4X400M Relay Men', '5000M Men', '50Km Race Walk Men', '800M Men',
       'Decathlon Men', 'Discus Throw Men', 'Hammer Throw Men',
       'High Jump Men', 'Javelin Throw Men', 'Long Jump Men',
       'Marathon Men', 'Pole Vault Men', 'Shot Put Men',
       'Triple Jump Men', '10000M Women', '100M Hurdles Women',
       '100M Women', '1500M Women', '200M Women', '20Km Race Walk Women',
       '3000M Steeplechase Women', '400M Hurdles Women', '400M Women',
       '4X100M Relay Women', '4X400M Relay Women', '5000M Women',
       '800M Women', 'Discus Throw Women', 'Hammer Throw Women',
       'Heptathlon Women', 'High Jump Women', 'Javelin Throw Women',
       'Long Jump Women', 'Marathon Women', 'Pole Vault Women',
       'Shot Put Women', 'Triple Jump Women'], dtype=object)

As in the previous section, this analysis will keep the individual running events and drop the remainder.

In [37]:
ol_tf_running_events = {'10000M Men', '100M Men', '110M Hurdles Men', '1500M Men',
       '200M Men', '3000M Steeplechase Men',
       '400M Hurdles Men', '400M Men', '5000M Men',
       '800M Men', 'Marathon Men', '10000M Women', '100M Hurdles Women',
       '100M Women', '1500M Women', '200M Women',
       '3000M Steeplechase Women', '400M Hurdles Women', '400M Women',
       '5000M Women', '800M Women', 'Marathon Women'}
In [38]:
indices_to_remove = [ol_tf.index[i] for i in range(len(ol_tf)) 
                    if not ol_tf['Event'].iloc[i] in ol_tf_running_events]
ol_tf_running = ol_tf.drop(index=indices_to_remove)
ol_tf_running.head()
Out[38]:
Gender Event Location Year Medal Name Nationality Result
1 M 10000M Men Rio 2016 G Mohamed FARAH USA 25:05.17
2 M 10000M Men Rio 2016 S Paul Kipngetich TANUI KEN 27:05.64
3 M 10000M Men Rio 2016 B Tamirat TOLA ETH 27:06.26
4 M 10000M Men Beijing 2008 G Kenenisa BEKELE ETH 27:01.17
5 M 10000M Men Beijing 2008 S Sileshi SIHINE ETH 27:02.77
In [39]:
ol_tf_running['Event'].unique()
Out[39]:
array(['10000M Men', '100M Men', '110M Hurdles Men', '1500M Men',
       '200M Men', '3000M Steeplechase Men', '400M Hurdles Men',
       '400M Men', '5000M Men', '800M Men', 'Marathon Men',
       '10000M Women', '100M Hurdles Women', '100M Women', '1500M Women',
       '200M Women', '3000M Steeplechase Women', '400M Hurdles Women',
       '400M Women', '5000M Women', '800M Women', 'Marathon Women'],
      dtype=object)

This now contains the data of interest.

3.2.2. Missing Values

In [40]:
# Check for missing values in each column.
ol_tf_running.isnull().sum()
Out[40]:
Gender         0
Event          0
Location       0
Year           0
Medal          0
Name           0
Nationality    0
Result         0
dtype: int64

No missing values are shown but this is deceptive, since some of the 'Result' fields conatin the string 'None'.

In [41]:
ol_tf_running[ol_tf_running['Result'] == 'None'].head()
Out[41]:
Gender Event Location Year Medal Name Nationality Result
107 M 100M Men London 1908 S John RECTOR USA None
108 M 100M Men London 1908 B Robert KERR CAN None
190 M 110M Hurdles Men London 1908 B Arthur SHAW USA None
300 M 1500M Men Los Angeles 1932 B Philip EDWARDS CAN None
309 M 1500M Men St Louis 1904 B Lacey HEARN USA None
In [42]:
ol_tf_running.loc[ol_tf_running['Result'] == 'None', 'Result'] = pd.NaT
ol_tf_running.dropna(subset=['Result'], inplace=True)

3.2.3. Event Labels

The same approach will be used as in the previous section so that the data sets end up with a consistent set of labels for each event.

In [43]:
# Simple string processing in Event column
ol_tf_running['Event'] = ol_tf_running['Event'].str.replace("Women", "")
ol_tf_running['Event'] = ol_tf_running['Event'].str.replace("Men", "")
ol_tf_running['Event'] = ol_tf_running['Event'].str.replace("M ", "")
ol_tf_running['Event'] = ol_tf_running['Event'].str.replace(",", "")

new_columns = ['Hurdles', 'Road', 'Steeplechase']
to_replace = ['Hurdles', 'Marathon', 'Steeplechase']
replacement = ['', '42195', '']

for i in range(len(new_columns)):
    encode_events(ol_tf_running, new_columns[i], to_replace[i], replacement[i])
    
ol_tf_running.rename(columns={'Event': 'Track_Flat'}, inplace=True)

3.2.4. Results

The aim is to convert the Results string to a datetime object, extract the time from this and store it in a feature called 'Time'. The time formats vary a lot in this data set so some cleaning is needed.

It's possible to create general groups of events that share similar formats.

In [44]:
# Hurdle events
ol_tf_running_hurdles_groups = ol_tf_running.groupby('Hurdles')
# Road running events
ol_tf_running_road_groups = ol_tf_running.groupby('Road')
# Steeplechase
ol_tf_running_steeplechase_groups = ol_tf_running.groupby('Steeplechase')
# Track (flat) events
ol_tf_running_trackf_groups = ol_tf_running.groupby('Track_Flat')

event_groups = [ol_tf_running_hurdles_groups, 
          ol_tf_running_road_groups, 
          ol_tf_running_steeplechase_groups, 
          ol_tf_running_trackf_groups]

for group in event_groups:
    for event in list(group.groups.keys())[1:]: # Ignore the first event in each category where distance=0
        print("Event: {}".format(event))
        print(group.get_group(event)['Result'].head(3))
Event: 100 
1654    12.48
1655    12.59
1656    12.61
Name: Result, dtype: object
Event: 110 
152    13.05
153    13.17
154    13.24
Name: Result, dtype: object
Event: 400 
511    47.73
512    47.78
513    47.92
Name: Result, dtype: object
Event: 42195 
1376    02:08:44
1377     2:09:54
1378     2:10:05
Name: Result, dtype: object
Event: 3000 
439    8:03.28
440    8:04.28
441    8:11.52
Name: Result, dtype: object
Event: 100
70    9.81
71    9.89
72    9.91
Name: Result, dtype: object
Event: 10000
1    25:05.17
2    27:05.64
3    27:06.26
Name: Result, dtype: object
Event: 1500
232    3:50.00
233    3:50.11
234    3:50.24
Name: Result, dtype: object
Event: 200
313    19.78
314    20.02
315    20.12
Name: Result, dtype: object
Event: 400
586    43.03
587    43.76
588    43.85
Name: Result, dtype: object
Event: 5000
804    13:03.30
805    13:03.90
806    13:04.35
Name: Result, dtype: object
Event: 800
930    1:42.15
931    1:42.61
932    1:42.93
Name: Result, dtype: object

This shows it's possible to define three time formats in this result set:

In [45]:
# Time format for the sprint events
time_format_sprints = '%S.%f'

# Time format for middle distance events
time_format_middle = '%M:%S.%f'

# Time format for long distance events
time_format_long = '%H:%M:%S'

Examining each event in more detail shows that some further processing is needed.

Steeplechase

In [46]:
ol_tf_running_steeplechase_groups.get_group('3000 ')['Result'].head()
Out[46]:
439    8:03.28
440    8:04.28
441    8:11.52
442    8:10.34
443    8:10.49
Name: Result, dtype: object
In [47]:
# Convert to datetime and extract the time part only.
ol_tf_running.loc[ol_tf_running['Steeplechase'] == '3000 ', 'Time'] = pd.to_datetime(
    ol_tf_running[ol_tf_running['Steeplechase'] == '3000 ']['Result'], format=time_format_middle).apply(datetime.time)

Hurdles

In [48]:
ol_tf_running_hurdles_groups.get_group('100 ')['Result'].head()
Out[48]:
1654    12.48
1655    12.59
1656    12.61
1657    12.54
1658    12.64
Name: Result, dtype: object
In [49]:
ol_tf_running_hurdles_groups.get_group('110 ')['Result'].head()
Out[49]:
152    13.05
153    13.17
154    13.24
155    12.93
156    13.17
Name: Result, dtype: object
In [50]:
ol_tf_running_hurdles_groups.get_group('400 ')['Result'].head()
Out[50]:
511    47.73
512    47.78
513    47.92
514    47.25
515    47.98
Name: Result, dtype: object

In addition, some of the time strings have a leading '0:':

In [51]:
ol_tf_running[ol_tf_running['Result'] == '0:54.0']
Out[51]:
Gender Hurdles Road Steeplechase Track_Flat Location Year Medal Name Nationality Result Time
544 M 400 0 0 0 Antwerp 1920 G Frank LOOMIS USA 0:54.0 NaN
In [52]:
# Remove leading '0:':
ol_tf_running.loc[ol_tf_running['Hurdles'] == '400 ', 'Result'] = ol_tf_running[ol_tf_running['Hurdles'] == '400 ']['Result'].str.replace('0:', '')
In [53]:
# For all the Hurdles distances - convert to datetime and extract the time part only.
events = list(ol_tf_running_hurdles_groups.groups.keys())
events.remove(0) # Ignore the fist event in each category where distance=0
for event in events:
    ol_tf_running.loc[ol_tf_running['Hurdles'] == event, 'Time'] = pd.to_datetime(
        ol_tf_running[ol_tf_running['Hurdles'] == event]['Result'], format=time_format_sprints).apply(datetime.time)

Track (Flat)

In [54]:
ol_tf_running_trackf_groups.get_group('100')['Result'].head()
Out[54]:
70    9.81
71    9.89
72    9.91
73    9.69
74    9.89
Name: Result, dtype: object
In [55]:
ol_tf_running_trackf_groups.get_group('200')['Result'].head()
Out[55]:
313    19.78
314    20.02
315    20.12
316    19.30
317    19.96
Name: Result, dtype: object
In [56]:
ol_tf_running_trackf_groups.get_group('400')['Result'].head()
Out[56]:
586    43.03
587    43.76
588    43.85
589    43.75
590    44.74
Name: Result, dtype: object
In [57]:
ol_tf_running_trackf_groups.get_group('800')['Result'].head()
Out[57]:
930    1:42.15
931    1:42.61
932    1:42.93
933    1:44.65
934    1:44.70
Name: Result, dtype: object
In [58]:
ol_tf_running_trackf_groups.get_group('1500')['Result'].head()
Out[58]:
232    3:50.00
233    3:50.11
234    3:50.24
235    3:33.11
236    3:34.16
Name: Result, dtype: object
In [59]:
ol_tf_running_trackf_groups.get_group('5000')['Result'].head()
Out[59]:
804    13:03.30
805    13:03.90
806    13:04.35
807    12:57.82
808    13:02.80
Name: Result, dtype: object
In [60]:
ol_tf_running_trackf_groups.get_group('10000')['Result'].head()
Out[60]:
1    25:05.17
2    27:05.64
3    27:06.26
4    27:01.17
5    27:02.77
Name: Result, dtype: object

Track events for distances less than 800m all have times written in the format defined in time_format_sprints. 800m and above use the format defined in time_format_middle.

In [61]:
sprint_distances = ['100', '200', '400']
middle_distances = ['800', '1500', '5000', '10000']

As with the hurdles distances above, remove any leading '0:':

In [62]:
# Remove leading '0:':
for event in sprint_distances:
    ol_tf_running.loc[ol_tf_running['Track_Flat'] == event, 'Result'] = ol_tf_running[ol_tf_running['Track_Flat'] == event]['Result'].str.replace('0:', '')
In [63]:
# For the track sprint events - convert to datetime and extract the time part only.
for event in sprint_distances:
    ol_tf_running.loc[ol_tf_running['Track_Flat'] == event, 'Time'] = pd.to_datetime(
        ol_tf_running[ol_tf_running['Track_Flat'] == event]['Result'], format=time_format_sprints).apply(datetime.time)
In [64]:
# For the track middle distance events - convert to datetime and extract the time part only.
for event in middle_distances:
    ol_tf_running.loc[ol_tf_running['Track_Flat'] == event, 'Time'] = pd.to_datetime(
        ol_tf_running[ol_tf_running['Track_Flat'] == event]['Result'], format=time_format_middle).apply(datetime.time)

Road

In [65]:
ol_tf_running_road_groups.get_group('42195 ').head()
Out[65]:
Gender Hurdles Road Steeplechase Track_Flat Location Year Medal Name Nationality Result Time
1376 M 0 42195 0 0 Rio 2016 G Eliud Kipchoge ROTICH KEN 02:08:44 NaN
1377 M 0 42195 0 0 Rio 2016 S Feyisa LILESA ETH 2:09:54 NaN
1378 M 0 42195 0 0 Rio 2016 B Galen RUPP USA 2:10:05 NaN
1379 M 0 42195 0 0 Beijing 2008 G Samuel Kamau WANJIRU KEN 2h06:32 NaN
1380 M 0 42195 0 0 Beijing 2008 S Jaouad GHARIB MAR 2h07:16 NaN

Some specific examples show there are several problems:

In [66]:
ol_tf_running_road_groups.get_group('42195 ').loc[[1379]]
Out[66]:
Gender Hurdles Road Steeplechase Track_Flat Location Year Medal Name Nationality Result Time
1379 M 0 42195 0 0 Beijing 2008 G Samuel Kamau WANJIRU KEN 2h06:32 NaN
In [67]:
ol_tf_running_road_groups.get_group('42195 ').loc[[1392]]
Out[67]:
Gender Hurdles Road Steeplechase Track_Flat Location Year Medal Name Nationality Result Time
1392 M 0 42195 0 0 Montreal 1976 S Frank Charles SHORTER USA 2:10:45.8 NaN
In [68]:
ol_tf_running_road_groups.get_group('42195 ').loc[[1417]]
Out[68]:
Gender Hurdles Road Steeplechase Track_Flat Location Year Medal Name Nationality Result Time
1417 M 0 42195 0 0 Paris 1900 B Ernst FAST SWE 3-37:14.0 NaN

This shows several formatting problems:

  • some of the values for marathon times include an 'h' for hours instead of a ':' Remove these and replace with a colon:
  • sometimes there is a milliseconds value, sometimes not
  • sometimes '-' is used instead of ':'

Taking these in turn:

In [69]:
# Remove 'h'
ol_tf_running.loc[ol_tf_running['Road'] == '42195 ', 'Result'] = ol_tf_running[ol_tf_running['Road'] == '42195 ']['Result'].str.replace('h', ':')
In [70]:
# Remove milliseconds:
ol_tf_running.loc[ol_tf_running['Road'] == '42195 ', 'Result'] = ol_tf_running[ol_tf_running['Road'] == '42195 ']['Result'].str.replace('\..*', '')
In [71]:
# Replace '-' with ':'
ol_tf_running.loc[ol_tf_running['Road'] == '42195 ', 'Result'] = ol_tf_running[ol_tf_running['Road'] == '42195 ']['Result'].str.replace('-', ':')
In [72]:
ol_tf_running[ol_tf_running['Road'] == '42195 ']['Result'].head()
Out[72]:
1376    02:08:44
1377     2:09:54
1378     2:10:05
1379     2:06:32
1380     2:07:16
Name: Result, dtype: object

There are also some values that only include hours and minutes:

In [73]:
ol_tf_running[ol_tf_running['Result'] == '2:32']
Out[73]:
Gender Hurdles Road Steeplechase Track_Flat Location Year Medal Name Nationality Result Time
1406 M 0 42195 0 0 Amsterdam 1928 G Boughèra EL OUAFI FRA 2:32 NaN
In [74]:
for i in ol_tf_running[ol_tf_running['Road'] == '42195 '].index:
    if len(ol_tf_running['Result'].loc[i].split(':')) < 3:
        ol_tf_running['Result'].loc[i] = ol_tf_running['Result'].loc[i] + ':00'
In [75]:
# Convert to datetime and extract the time part only.
ol_tf_running.loc[ol_tf_running['Road'] == '42195 ', 'Time'] = pd.to_datetime(
    ol_tf_running[ol_tf_running['Road'] == '42195 ']['Result'], format=time_format_long).apply(datetime.time)

3.2.5. Other Processing

Cast Strings to Integers
In [76]:
# Several features now contain strings that would be easier to use as integers.
# Convert these to integers now.
columns_to_int = ['Track_Flat', 'Hurdles', 'Steeplechase', 'Road', 'Year']
string_to_int(ol_tf_running, columns_to_int)
Name

Some names include nicknames in double quotes. There is also a string encoding problem causing some characters to be displayed wrongly. For example, 'Emil ZÁTOPEK' and 'Katrin DÖRRE' below:

In [77]:
ol_tf_running['Name'].loc[25]
Out[77]:
'Emil ZÃ\x81TOPEK'
In [78]:
ol_tf_running['Name'].loc[2322]
Out[78]:
'Katrin DÃ\x96RRE'
In [79]:
# Check encoding of the file
with open("datasets/results.csv", 'rb') as file:
    print(chardet.detect(file.read()))
{'encoding': 'utf-8', 'confidence': 0.99, 'language': ''}

So chardet still suggests the file is utf-8 encoded. So we can try to clean this up by using the ftfy package to fix the bad encodings (Reference for ftfy: https://ftfy.readthedocs.io/en/latest/)

In [80]:
ol_tf_running['Name'] = ol_tf_running['Name'].apply(ftfy.fix_encoding)
In [81]:
ol_tf_running['Name'].loc[25]
Out[81]:
'Emil ZÁTOPEK'
In [82]:
ol_tf_running['Name'].loc[2322]
Out[82]:
'Katrin DÖRRE'

This shows the bad encodings have disappeared:

  • 'Emil ZÁTOPEK' -> 'Emil ZÁTOPEK'
  • 'Katrin DÖRRE' -> 'Katrin DÖRRE'

Other name text processing is the same as the previous section

In [83]:
# Use the processing function defined previously
process_names(ol_tf_running)
In [84]:
ol_tf_running['Name'].head()
Out[84]:
1            mohamed farah
2    paul kipngetich tanui
3             tamirat tola
4          kenenisa bekele
5           sileshi sihine
Name: Name, dtype: object
Gender

Female athletes are categorised as 'W' in the 'Gender' column. Change this to be 'F' for consistency with the other data sets.

In [85]:
ol_tf_running['Gender'] = ol_tf_running['Gender'].str.replace('W', 'F')

This concludes cleaning of the second data set, which will be named ol_tf_running from here on.

3.3. Top 1000 Running Times

3.3.1. Events

Select individual running events as with the previous two data sets.

In [86]:
print("Number of unique events is {}".format(len(top_running['Event'].unique())))
top_running['Event'].unique()
Number of unique events is 9
Out[86]:
array(['800 m', '400 m', '10,000 m', 'Marathon', '100 m', 'Half marathon',
       '200 m', '1500 m', '5000 m'], dtype=object)

These are all valid events for this analysis. No need to remove any.

3.3.2. Missing Values

In [87]:
top_running.isnull().sum()
Out[87]:
Rank             0
Time             0
Name             0
Country          0
Date of Birth    0
Place            8
City             0
Date             0
Gender           0
Event            0
dtype: int64

There are a few missing 'Place' values. This anlysis will not use this feature and it will not be included any further analysis anyway. No further action on this for now.

In [88]:
top_running.head()
Out[88]:
Rank Time Name Country Date of Birth Place City Date Gender Event
0 1 00:01:40.910000 David Rudisha KEN 1988-12-17 1.0 London 2012-09-08 Men 800 m
1 2 00:01:41.010000 David Rudisha KEN 1988-12-17 1.0 Rieti 2010-08-29 Men 800 m
2 3 00:01:41.090000 David Rudisha KEN 1988-12-17 1.0 Berlin 2010-08-22 Men 800 m
3 4 00:01:41.110000 Wilson Kipketer DEN 1970-12-12 1.0 Köln 1997-08-24 Men 800 m
4 5 00:01:41.240000 Wilson Kipketer DEN 1970-12-12 1.0 Zürich 1997-08-13 Men 800 m

3.3.3. Event Labels

The same approach will be used as in the previous section so that the data sets end up with a consistent set of labels for each event.

In [89]:
# Simple string processing in Event column

# Replace the race type strings ('Marathon', 'Half marathon') with their distance in metres:
racetype = ['Marathon', 'Half marathon']
distance = ['42195 Road', '21098 Road']

for i in range(len(racetype)):
    top_running.loc[top_running['Event'] == racetype[i], 'Event'] = top_running[
        top_running['Event'] == racetype[i]]['Event'].str.replace(racetype[i], distance[i])                                                     
                                                                                
top_running['Event'] = top_running['Event'].str.replace(",", "")
new_columns = ['Road']
to_replace = [' Road']
replacement = ['']

for i in range(len(new_columns)):
    encode_events(top_running, new_columns[i], to_replace[i], replacement[i])

top_running['Event'] = top_running['Event'].str.replace(" m", "")

top_running.rename(columns={'Event': 'Track_Flat'}, inplace=True)
In [90]:
top_running.head()
Out[90]:
Rank Time Name Country Date of Birth Place City Date Gender Road Track_Flat
0 1 00:01:40.910000 David Rudisha KEN 1988-12-17 1.0 London 2012-09-08 Men 0 800
1 2 00:01:41.010000 David Rudisha KEN 1988-12-17 1.0 Rieti 2010-08-29 Men 0 800
2 3 00:01:41.090000 David Rudisha KEN 1988-12-17 1.0 Berlin 2010-08-22 Men 0 800
3 4 00:01:41.110000 Wilson Kipketer DEN 1970-12-12 1.0 Köln 1997-08-24 Men 0 800
4 5 00:01:41.240000 Wilson Kipketer DEN 1970-12-12 1.0 Zürich 1997-08-13 Men 0 800

3.3.4. Other Processing

Date
In [91]:
# In the 'Date' column, the year will be used as one of the keys to merge the data sets.
# Therefore, create a separate 'Year' column and populate it.
top_running.insert(top_running.columns.get_loc('Date'), 'Year', 0)
top_running['Year'] = top_running['Date'].str.split("-", expand=True)[0]
#top_running.rename(columns={'Date': 'Year'}, inplace=True)
Cast Strings to Integers
In [92]:
# Several features now contain strings that would be easier to use as integers.
# Convert these to integers now.
columns_to_int = ['Track_Flat', 'Road', 'Year']
string_to_int(top_running, columns_to_int)
Time

Convert the string into a datetime object. First look at what the different time formats used in each event are:

In [93]:
# Road running events
top_running_road_groups = top_running.groupby('Road')
# Track (flat) events
top_running_trackf_groups = top_running.groupby('Track_Flat')

event_groups = [top_running_road_groups,
                top_running_trackf_groups]

for group in event_groups:
    for event in list(group.groups.keys())[1:]: # Ignore the first event in each category where distance=0
        print("Event: {}".format(event))
        print(group.get_group(event)['Time'].head(3))
Event: 21098
8182    00:58:23
8183    00:58:31
8184    00:58:33
Name: Time, dtype: object
Event: 42195
3001    02:02:57
3002    02:03:02
3003    02:03:03
Name: Time, dtype: object
Event: 100
4000    00:00:10.490000
4001    00:00:10.610000
4002    00:00:10.620000
Name: Time, dtype: object
Event: 200
9191    00:00:21.340000
9192    00:00:21.560000
9193    00:00:21.620000
Name: Time, dtype: object
Event: 400
1000    00:00:47.600000
1001    00:00:47.990000
1002    00:00:48.160000
Name: Time, dtype: object
Event: 800
0    00:01:40.910000
1    00:01:41.010000
2    00:01:41.090000
Name: Time, dtype: object
Event: 1500
11230    00:03:50.070000
11231    00:03:50.460000
11232    00:03:50.980000
Name: Time, dtype: object
Event: 5000
12232    00:12:37.350000
12233    00:12:39.360000
12234    00:12:39.740000
Name: Time, dtype: object
Event: 10000
2007    00:26:17.530000
2008    00:26:20.310000
2009    00:26:22.750000
Name: Time, dtype: object

For the road running events, the time format is the same as already defined in time_format_long above. For the track events, most of the times have the same format ('%H:%M:%S.%f'), but there are occassional cases where the milliseconds field is missing. For these cases it's possible to use the infer_datetime_format feature of pandas.to_datetime().

In [94]:
# Convert strings in the Time column to datetime objects for the road running events.
# Convert to datetime and extract the time part only.
events = top_running['Road'].unique().tolist()
events.remove(0)
for event in events:
    top_running.loc[top_running['Road'] == event, 'Time'] = pd.to_datetime(
        top_running[top_running['Road'] == event]['Time'], format=time_format_long).apply(datetime.time)
In [95]:
# Convert strings in the Time column to datetime objects for the track running events.
# Convert to datetime and extract the time part only.
events = top_running['Track_Flat'].unique().tolist()
events.remove(0)
for event in events:
    top_running.loc[top_running['Track_Flat'] == event, 'Time'] = pd.to_datetime(
        top_running[top_running['Track_Flat'] == event]['Time'], infer_datetime_format=True).apply(datetime.time)
Name

Looking at the names in this data set - they seem straightforward:

In [96]:
top_running['Name'].head()
Out[96]:
0      David Rudisha
1      David Rudisha
2      David Rudisha
3    Wilson Kipketer
4    Wilson Kipketer
Name: Name, dtype: object

Other name text processing is the same as the previous section

In [97]:
# Use the processing function defined previously
process_names(top_running)
Gender

To be consistent with the other data sets, change the possible values of the 'Gender' feature to be either 'M' or 'F' instead of 'Men' or 'Women'.

In [98]:
top_running['Gender'] = ['M' if top_running['Gender'].iloc[i]=='Men' else 'F' for i in top_running.index]
In [99]:
top_running.head()
Out[99]:
Rank Time Name RawName Country Date of Birth Place City Year Date Gender Road Track_Flat
0 1 00:01:40.910000 david rudisha David Rudisha KEN 1988-12-17 1.0 London 2012 2012-09-08 M 0 800
1 2 00:01:41.010000 david rudisha David Rudisha KEN 1988-12-17 1.0 Rieti 2010 2010-08-29 M 0 800
2 3 00:01:41.090000 david rudisha David Rudisha KEN 1988-12-17 1.0 Berlin 2010 2010-08-22 M 0 800
3 4 00:01:41.110000 wilson kipketer Wilson Kipketer DEN 1970-12-12 1.0 Köln 1997 1997-08-24 M 0 800
4 5 00:01:41.240000 wilson kipketer Wilson Kipketer DEN 1970-12-12 1.0 Zürich 1997 1997-08-13 M 0 800
Olympic Label

Later in this analysis, times from this data set will be merged into the Olympic data set. To facilitate this, it is necessary to label which rows correspond to an Olympic Games. This will be done by comparing the 'Date' field of the result to the known dates of the Olympic Games.

In [100]:
# Convert dates to datetime format
top_running['Date'] = pd.to_datetime(top_running['Date'], infer_datetime_format=True)
In [101]:
# What's the earliest year in the top_running data set?
min(top_running['Year'].tolist())
Out[101]:
1962

So there is no need to look at years before 1962.

In [102]:
# List of dates of Olympic summer games
# Source: https://en.wikipedia.org/wiki/Summer_Olympic_Games
# Use format Year-month-day
olympic_dates = [
                ['1964-10-10', '1964-10-24'],
                ['1968-10-12', '1968-10-27'],
                ['1972-08-26', '1972-09-10'],
                ['1976-07-17', '1976-08-01'],
                ['1980-07-19', '1980-08-03'],
                ['1984-07-28', '1984-08-12'],
                ['1988-09-17', '1988-10-02'],
                ['1992-07-25', '1992-08-09'],
                ['1996-07-19', '1996-08-04'],
                ['2000-09-15', '2000-10-01'],
                ['2004-08-13', '2004-08-29'],
                ['2008-08-08', '2008-08-24'],
                ['2012-07-27', '2012-08-12'],
                ['2016-08-05', '2016-08-21']
                ]

olympic_dates_df = pd.DataFrame(olympic_dates, columns=['Start', 'End'], 
                                index=[1964,
                                       1968,
                                       1972,
                                       1976,
                                       1980,
                                       1984,
                                       1988,
                                       1992,
                                       1996, 
                                       2000, 
                                       2004, 
                                       2008, 
                                       2012, 
                                       2016])
olympic_dates_df['Start'] = pd.to_datetime(olympic_dates_df['Start'], format='%Y-%m-%d')
olympic_dates_df['End'] = pd.to_datetime(olympic_dates_df['End'], format='%Y-%m-%d')
In [103]:
top_running.insert(loc=top_running.columns.get_loc('Date'), column='Olympics', value=False)
In [104]:
for y in olympic_dates_df.index:
    top_running.loc[top_running['Year'] == y, 'Olympics'] = (top_running['Year'] == y) & (top_running['Date'] >= olympic_dates_df.loc[y, 'Start']) & (top_running['Date'] <= olympic_dates_df.loc[y, 'End'])
In [105]:
top_running[top_running['Olympics']== True].head()
Out[105]:
Rank Time Name RawName Country Date of Birth Place City Year Olympics Date Gender Road Track_Flat
17 18 00:01:42.150000 david rudisha David Rudisha KEN 1988-12-17 1.0 Rio de Janeiro 2016 True 2016-08-15 M 0 800
41 42 00:01:42.580000 vebjørn rodal Vebjørn Rodal NOR 1972-09-16 1.0 Atlanta 1996 True 1996-07-31 M 0 800
47 45 00:01:42.610000 taoufik makhloufi Taoufik Makhloufi ALG 1988-04-29 2.0 Rio de Janeiro 2016 True 2016-08-15 M 0 800
55 56 00:01:42.740000 hezekiél sepeng Hezekiél Sepeng RSA 1974-06-30 2.0 Atlanta 1996 True 1996-07-31 M 0 800
62 63 00:01:42.790000 frederick onyancha Frederick Onyancha KEN 1969-12-25 3.0 Atlanta 1996 True 1996-07-31 M 0 800
Top 10 Label

It is useful to label the top 10 performances in each event, for each gender, and for every year. This is because the data set includes the top 1000 performances for all events, and since the main concern for this analysis is the factors affecting the improvement of performances, it is worth identifying the top 10 performances in each year.

In [106]:
top_running.insert(loc=top_running.columns.get_loc('Time'), column='Top 10', value=False)

event_categories = ['Track_Flat', 'Road']

for gender in top_running['Gender'].unique().tolist():
    for category in event_categories:
        events = top_running[category].unique().tolist()
        events.remove(0)
        events.sort()
        for event in events:
            for year in top_running[top_running[category] == event]['Year'].unique().tolist():
                # print("Category {}, Event {}, Year {}, Gender {}".format(category, event, year, gender)) #debug
                best_times_per_year = top_running[(top_running[category] == event) & 
                                         (top_running['Year'] == year) &
                                         (top_running['Gender'] == gender)]['Time'].tolist()
                if len(best_times_per_year) > 0:
                    if len(best_times_per_year) >= 10:
                        cutoff = sorted(best_times_per_year)[9] 
                        #print(cutoff) # For debugging
                    else:
                        cutoff = sorted(best_times_per_year)[len(best_times_per_year)-1]
                        #print(sorted(best_times_per_year)[len(best_times_per_year)-1]) # For debugging
                    top_running.loc[(top_running[category] == event) & 
                                    (top_running['Year'] == year) & (top_running['Gender'] == gender),
                                    'Top 10'] = top_running.loc[(top_running[category] == event) & 
                                    (top_running['Year'] == year) &
                                    (top_running['Gender'] == gender)]['Time'] < cutoff
                else:
                    continue

Perform a check that this has worked:

In [107]:
top_running[(top_running['Road'] == 42195) & (top_running['Top 10'] == True) & (top_running['Year'] == 2012)].head()
Out[107]:
Rank Top 10 Time Name RawName Country Date of Birth Place City Year Olympics Date Gender Road Track_Flat
3020 20 True 02:04:15 geoffrey mutai Geoffrey Mutai KEN 1981-10-07 1.0 Berlin 2012 False 2012-09-30 M 42195 0
3021 21 True 02:04:16 dennis kimetto Dennis Kimetto KEN 1984-04-22 2.0 Berlin 2012 False 2012-09-30 M 42195 0
3022 22 True 02:04:23 ayele absehero Ayele Absehero ETH 1990-12-28 1.0 Dubai 2012 False 2012-01-27 M 42195 0
3032 32 True 02:04:38 tsegay kebede Tsegay Kebede ETH 1987-01-15 1.0 Chicago 2012 False 2012-10-07 M 42195 0
3035 35 True 02:04:44 wilson kipsang kiprotich Wilson Kipsang Kiprotich KEN 1982-03-15 1.0 London 2012 False 2012-04-22 M 42195 0

This concludes the processing of this data set, and the data frame will be named top_running from this point on.

3.4. Merging Data Sets

The full Olympic data set has information about athlete characteristics but no times or results. Both the track and field data set and the top running times data set have times and results, but no athlete data. So to answer questions about how results and athlete characteristics are related it is necessary to merge these data sets. Athletes often compete in multiple Olympic Games and in different events, so it will be necessary to find a match based on the year, the event, medal awardd and the athlete's name. It will be straightforward to match the year across both data sets, and also the events and medals, because the labels are already standardised. The name presents an additional challenge because it is written differently in each data set for some athletes appearing in both. For example, here is how Mo Farah's performance in the 10000 m in 2016 looks in the Olympic track and field data set:

In [108]:
ol_tf.loc[[1]]
Out[108]:
Gender Event Location Year Medal Name Nationality Result
1 M 10000M Men Rio 2016 G Mohamed FARAH USA 25:05.17

Compare the way his name is written to the way it appears for the same performance in the full Olympic data set:

In [109]:
ol_running.loc[[66487]]
Out[109]:
ID Name RawName Gender Age Height Weight Team NOC Games Year Season City Sport Hurdles Road Steeplechase Track_Flat Medal
66487 34012 mohamed muktar jama farah Mohamed Muktar Jama "Mo" Farah M 33.0 175.0 58.0 Great Britain GBR 2016 Summer 2016 Summer Rio de Janeiro Athletics 0 0 0 10000 G

Row 66487 contains Mo Farah's performance matching the one in the Olympic track and field data, but the name is written very differently. To overcome this, a method called fuzzy matching will be used.

The aim is to merge the time data into the ol_running data frame, where it is available.

In [110]:
ol_running.head()
Out[110]:
ID Name RawName Gender Age Height Weight Team NOC Games Year Season City Sport Hurdles Road Steeplechase Track_Flat Medal
26 8 cornelia aalten Cornelia "Cor" Aalten (-Strannood) F 18.0 168.0 NaN Netherlands NED 1932 Summer 1932 Summer Los Angeles Athletics 0 0 0 100 NaN
98 34 jamale aarrass Jamale (Djamel-) Aarrass (Ahrass-) M 30.0 187.0 76.0 France FRA 2012 Summer 2012 Summer London Athletics 0 0 0 1500 NaN
148 55 antonio abadia beci Antonio Abadia Beci M 26.0 170.0 65.0 Spain ESP 2016 Summer 2016 Summer Rio de Janeiro Athletics 0 0 0 5000 NaN
190 86 jos manuel abascal gmez Jos Manuel Abascal Gmez M 22.0 182.0 67.0 Spain ESP 1980 Summer 1980 Summer Moskva Athletics 0 0 0 1500 NaN
191 86 jos manuel abascal gmez Jos Manuel Abascal Gmez M 26.0 182.0 67.0 Spain ESP 1984 Summer 1984 Summer Los Angeles Athletics 0 0 0 1500 B

Add two columns to ol_running, one for the time and one for the merged-in name, which can be used as a sanity check for the data merging process.

In [111]:
ol_running.insert(loc=len(ol_running.columns), column='Time', value=pd.NaT)
ol_running.insert(loc=ol_running.columns.get_loc('RawName'), column='Merged_name', value=np.NaN)
ol_running.insert(loc=ol_running.columns.get_loc('RawName'), column='Ratio', value=np.NaN)

Now define a function to merge the times from the ol_tf_running data set into the ol_running data set. This function splits the results in each data set into groups by event, year, gender and medal awarded. This cuts the full results set into much smaller and more manageable groups. Every pass of the loop examines a pair of corresponding groups, one from each data set. Each group of results is for the same set of event, year, gender and medal awarded. The function then compares the names in each. If the strings don't match in a simple way (using str.find(), then it applies the fuzzy matching algorithm to find the best match (process.extractOne()). If the match ratio between the two strings being compared is above a threshold (chosen arbitrarily as 50) then use the two rows being compared as a match, and save the time, name and ratio in the ol_running dataset.

In [112]:
def merge_times(df, event_categories, debug=False):
    """
    Helper function to merge times from one dataset into the ol_running dataset.
    Event, year, gender, medal and athlete name are used as inputs to match athlete data from
    one data frame to the same athlete's performance in the other data frame.
    Names are matched using fuzzy string matching.
    
    Input parameters:
    df               - data frame to merge
    event_categories - list of categories of events
    debug            - True/False flag to indicate whether to print out debugging information.
    
    Returns:
    None
    """
    for category in event_categories:
        if debug:
            print(category)
        ol_running_groups = ol_running.groupby([category, 'Year', 'Gender', 'Medal'])
        df_groups = df.groupby([category, 'Year', 'Gender', 'Medal'])
        events = ol_running[category].unique().tolist()
        events.remove(0)
        for event in events:
            if debug:
                print(event)
            for gender in ol_running['Gender'].unique().tolist():
                if debug:
                    print(gender)
                for year in ol_running['Year'].unique().tolist():
                    if debug:
                        print(year)
                    for medal in ol_running['Medal'].unique().tolist():
                        if debug:
                            print(medal)
                        try:
                            group_1 = ol_running_groups.get_group((event, year, gender, medal))
                        except KeyError:
                            if debug:
                                print("No results for this combination in ol_running_groups")
                            continue
                        try:
                            group_2 = df_groups.get_group((event, year, gender, medal))
                        except KeyError:
                            if debug:
                                print("No results for this combination in df_groups")
                            continue
                        name_options = group_1['Name'].tolist()
                        for name in group_2['Name']:
                            find_result = group_1['Name'].str.find(name)
                            i = find_result[find_result>-1].index
                            if debug:
                                print(i)
                            if(i.any()):
                                if debug:
                                    print("str.find found a match: {}".format(name))
                                # Don't replace a time if one has already been found for this row
                                if pd.isnull(ol_running.loc[i]['Time'].tolist()):
                                    ol_running.loc[i, 'Time'] = group_2.loc[group_2['Name']==name]['Time'].tolist()
                            else:
                                if debug:
                                    print("str.find did NOT find a match:")   
                                best_match = process.extractOne(name, name_options)
                                if debug:
                                    print(best_match)
                                    print("Best name: {}".format(best_match[0]))
                                    print("Match confidence: {}".format(best_match[1]))
                                    print("index={}".format(group_1[group_1['Name']==best_match[0]].index))
                                if best_match[1] > 50:
                                    i=group_1[group_1['Name']==best_match[0]].index
                                    # Don't replace a time if one has already been found for this row
                                    if pd.isnull(ol_running.loc[i]['Time'].tolist()):
                                        ol_running.loc[i, 'Merged_name'] = name
                                        ol_running.loc[i, 'Ratio'] = best_match[1]
                                        ol_running.loc[i, 'Time'] = group_2.loc[group_2['Name']==name]['Time'].tolist()
In [113]:
event_categories = ['Track_Flat', 'Hurdles', 'Road', 'Steeplechase']
merge_times(ol_tf_running, event_categories)

Check how well the fuzzy matching algorithm is doing:

In [114]:
ol_running.loc[~ol_running['Time'].isnull()].head()
Out[114]:
ID Name Merged_name Ratio RawName Gender Age Height Weight Team ... Year Season City Sport Hurdles Road Steeplechase Track_Flat Medal Time
191 86 jos manuel abascal gmez josé manuel abascal 95.0 Jos Manuel Abascal Gmez M 26.0 182.0 67.0 Spain ... 1984 Summer Los Angeles Athletics 0 0 0 1500 B 00:03:34.300000
655 379 addis abebe NaN NaN Addis Abebe M 21.0 160.0 50.0 Ethiopia ... 1992 Summer Barcelona Athletics 0 0 0 10000 B 00:28:00.070000
720 411 gezahgne abera gezahegne abera 97.0 Gezahgne Abera M 22.0 166.0 58.0 Ethiopia ... 2000 Summer Sydney Athletics 0 42195 0 0 G 02:10:11
747 428 elvan abeylegesse NaN NaN Elvan Abeylegesse F 25.0 159.0 40.0 Turkey ... 2008 Summer Beijing Athletics 0 0 0 5000 S 00:15:42.740000
915 519 harold maurice abrahams harold abrahams 86.0 Harold Maurice Abrahams M 24.0 183.0 75.0 Great Britain ... 1924 Summer Paris Athletics 0 0 0 100 G 00:00:10.600000

5 rows × 22 columns

From a visual scan of the three columns corresponding to athlete name, it looks like the fuzzy matching algorithm is doing a good job of finding the correct names. The matching algorithm uses a threshold value of 50 for the match ratio. As a further check, examine the matches with the lowest match ratio:

In [115]:
ol_running[ol_running['Ratio'] < 70]
Out[115]:
ID Name Merged_name Ratio RawName Gender Age Height Weight Team ... Year Season City Sport Hurdles Road Steeplechase Track_Flat Medal Time
83230 42280 lorraine graham lorraine fenton 66.0 Lorraine Graham (-Fenton) F 27.0 174.0 59.0 Jamaica ... 2000 Summer Sydney Athletics 0 0 0 400 S 00:00:49.580000
126564 63825 katharina anna krau käthe krauss 52.0 Katharina Anna "Kthe" Krau F 29.0 176.0 72.0 Germany ... 1936 Summer Berlin Athletics 0 0 0 100 B 00:00:11.900000
169845 85373 yuliya viktorovna nesterenko yuliya nestsiarenka 65.0 Yuliya Viktorovna Nesterenko (Bartsevich-) F 25.0 176.0 62.0 Belarus ... 2004 Summer Athina Athletics 0 0 0 100 G 00:00:10.930000
209437 105164 flix snchez marcelo felix sanchez 69.0 Flix Snchez Marcelo M 26.0 178.0 64.0 Dominican Republic ... 2004 Summer Athina Athletics 400 0 0 0 G 00:00:47.630000
209439 105164 flix snchez marcelo felix sanchez 69.0 Flix Snchez Marcelo M 34.0 178.0 64.0 Dominican Republic ... 2012 Summer London Athletics 400 0 0 0 G 00:00:47.630000
225056 113067 son gi-jeong kitei son 54.0 Son Gi-Jeong M 21.0 170.0 60.0 Japan ... 1936 Summer Berlin Athletics 0 42195 0 0 G 02:29:19
260600 130418 erik wilhelm wiln erkka wilen 66.0 Erik Wilhelm "Erkka" Wiln M 25.0 180.0 73.0 Finland ... 1924 Summer Paris Athletics 400 0 0 0 S 00:00:53.800000

7 rows × 22 columns

So there are only seven values with a match ratio below 70, and they all look correct. So the matching algorithm seems to be working well.

In [116]:
ol_running.loc[~ol_running['Time'].isnull()].info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1177 entries, 191 to 270000
Data columns (total 22 columns):
ID              1177 non-null int64
Name            1177 non-null object
Merged_name     815 non-null object
Ratio           815 non-null float64
RawName         1177 non-null object
Gender          1177 non-null object
Age             1171 non-null float64
Height          1123 non-null float64
Weight          1121 non-null float64
Team            1177 non-null object
NOC             1177 non-null object
Games           1177 non-null object
Year            1177 non-null int16
Season          1177 non-null object
City            1177 non-null object
Sport           1177 non-null object
Hurdles         1177 non-null int16
Road            1177 non-null int32
Steeplechase    1177 non-null int16
Track_Flat      1177 non-null int16
Medal           1177 non-null object
Time            1177 non-null object
dtypes: float64(4), int16(4), int32(1), int64(1), object(12)
memory usage: 179.3+ KB

So this method has merged in 1177 time data fields. Next, merge in times from the top_running data frame. This is a little more complicated because it is necessary to group on events marked as True in the 'Olympics' feature of this data frame to screen out other performances by the same athlete in the same year. In addition, some athletes may run several heats and a final in a single Games. Therefore, it is necessary to use the time from the race with the latest date, and within the period of the Games in question. If there turn out to be more than one (e.g., if a final and a heat were run on the same day) then we choose one arbitrarily. This is not a huge problem, since we are attempting to relate performances to height, weight and age, and those factors will not change within one day anyway.

The ol_tf_running data set contained only medal-winning performances, so it was (almost) guaranteed that there would be a corresponding row in the ol_running data set. The top_running data set differs from the ol_tf_running data set in that it contains many non-medal winning performances. Therefore, it's not possible to use the 'Medal' field to group the performances and use that to help match them. This means there is a larger scope for false positives, where the fuzzy matching algorithm wrongly identifies two similar names as a match. To help solve this, the match ratio threshold is raised from 50 to 80 in this function. It's not straightforward to combine this extra complexity into the existing merge_times function, so write a new function to handle this.

In [117]:
def merge_times_ext(df, event_categories, debug=False):
    """
    Helper function to merge times from one dataset into the ol_running dataset.
    Event, year, gender, and athlete name are used as inputs to match athlete data from
    one data frame to the same athlete's performance in the other data frame.
    Names are matched using fuzzy string matching.
    
    Input parameters:
    df               - data frame to merge
    event_categories - list of categories of events
    debug            - True/False flag to indicate whether to print out debugging information.
    
    Returns:
    None
    """
    for category in event_categories:
        if debug:
            print(category)
        ol_running_groups = ol_running.groupby([category, 'Year', 'Gender'])
        df_groups = df.groupby([category, 'Year', 'Gender', 'Olympics'])
        events = ol_running[category].unique().tolist()
        events.remove(0)
        for event in events:
            if debug:
                print(event)
            for gender in ol_running['Gender'].unique().tolist():
                if debug:
                    print(gender)
                for year in ol_running['Year'].unique().tolist():
                    if debug:
                        print(year)
                    try:
                        group_1 = ol_running_groups.get_group((event, year, gender))
                    except KeyError:
                        if debug:
                            print("No results for this combination in ol_running_groups")
                        continue
                    try:
                        group_2 = df_groups.get_group((event, year, gender, True))
                    except KeyError:
                        if debug:
                            print("No results for this combination in df_groups")
                        continue
                    name_options = group_1['Name'].tolist()
                    for name in group_2['Name']:
                        find_result = group_1['Name'].str.find(name)
                        i = find_result[find_result>-1].index
                        if debug:
                            print(i)
                        if(i.any()):
                            if debug:
                                print("str.find found a match: {}".format(name))
                            # Don't replace a time if one has already been found for this row
                            if pd.isnull(ol_running.loc[i]['Time'].tolist()):
                                latest_race_date = group_2.loc[(group_2['Name']==name, 'Date')].max()
                                ol_running.loc[i, 'Time'] = group_2.loc[
                                    (group_2['Name'] == name) & 
                                    (group_2['Date'] == latest_race_date)]['Time'].tolist()[0]
                        else:
                            if debug:
                                print("str.find did NOT find a match:")   
                            best_match = process.extractOne(name, name_options)
                            if debug:
                                print(best_match)
                                print("Best name: {}".format(best_match[0]))
                                print("Match confidence: {}".format(best_match[1]))
                                print("index={}".format(group_1[group_1['Name']==best_match[0]].index))
                            if best_match[1] > 80:
                                i=group_1[group_1['Name']==best_match[0]].index
                                # Don't replace a time if one has already been found for this row
                                if pd.isnull(ol_running.loc[i]['Time'].tolist()):
                                    ol_running.loc[i, 'Merged_name'] = name
                                    ol_running.loc[i, 'Ratio'] = best_match[1]
                                    latest_race_date = group_2.loc[(group_2['Name']==name, 'Date')].max()
                                    ol_running.loc[i, 'Time'] = group_2.loc[
                                        (group_2['Name'] == name) & 
                                        (group_2['Date'] == latest_race_date)]['Time'].tolist()[0]
In [118]:
event_categories = ['Track_Flat', 'Road']
merge_times_ext(top_running, event_categories)
In [119]:
ol_running.loc[~ol_running['Time'].isnull()].head()
Out[119]:
ID Name Merged_name Ratio RawName Gender Age Height Weight Team ... Year Season City Sport Hurdles Road Steeplechase Track_Flat Medal Time
191 86 jos manuel abascal gmez josé manuel abascal 95.0 Jos Manuel Abascal Gmez M 26.0 182.0 67.0 Spain ... 1984 Summer Los Angeles Athletics 0 0 0 1500 B 00:03:34.300000
559 321 ahmad hassan abdullah abdullah ahmad hassan 95.0 Ahmad Hassan Abdullah M 27.0 164.0 55.0 Qatar ... 2008 Summer Beijing Athletics 0 0 0 10000 NaN 00:27:23.750000
655 379 addis abebe NaN NaN Addis Abebe M 21.0 160.0 50.0 Ethiopia ... 1992 Summer Barcelona Athletics 0 0 0 10000 B 00:28:00.070000
720 411 gezahgne abera gezahegne abera 97.0 Gezahgne Abera M 22.0 166.0 58.0 Ethiopia ... 2000 Summer Sydney Athletics 0 42195 0 0 G 02:10:11
745 428 elvan abeylegesse NaN NaN Elvan Abeylegesse F 21.0 159.0 40.0 Turkey ... 2004 Summer Athina Athletics 0 0 0 1500 NaN 00:04:00.670000

5 rows × 22 columns

In [120]:
ol_running.loc[~ol_running['Time'].isnull()].info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1547 entries, 191 to 270808
Data columns (total 22 columns):
ID              1547 non-null int64
Name            1547 non-null object
Merged_name     1059 non-null object
Ratio           1059 non-null float64
RawName         1547 non-null object
Gender          1547 non-null object
Age             1541 non-null float64
Height          1489 non-null float64
Weight          1488 non-null float64
Team            1547 non-null object
NOC             1547 non-null object
Games           1547 non-null object
Year            1547 non-null int16
Season          1547 non-null object
City            1547 non-null object
Sport           1547 non-null object
Hurdles         1547 non-null int16
Road            1547 non-null int32
Steeplechase    1547 non-null int16
Track_Flat      1547 non-null int16
Medal           1208 non-null object
Time            1547 non-null object
dtypes: float64(4), int16(4), int32(1), int64(1), object(12)
memory usage: 235.7+ KB

Merging the second data set in has increased the number of rows with a time by a few hundred.

4. Data Analysis and Visualisation

4.1. Question 1: How have athletes' performances changed through history?

To investigate this plot athletes' results (i.e. times) against the date of the performance. This will be done individually for each event and separately for each gender. The plots use colour to identify Olympic medal winning performances (see the key). Both the ol_tf_running and top_running datasets are used for this. The analysis will follow in section 5.1.

In [121]:
# Group by gender
top_running_gender_groups = top_running.groupby(['Gender', 'Top 10'])
top_running_m = top_running_gender_groups.get_group(('M', True))
top_running_f = top_running_gender_groups.get_group(('F', True))
ol_tf_running_gender_groups = ol_tf_running.groupby('Gender')
ol_tf_running_m = ol_tf_running_gender_groups.get_group('M')
ol_tf_running_f = ol_tf_running_gender_groups.get_group('F')
In [122]:
def build_graph_labels(gender, category, event, characteristic=None):
    """
    build_graph_labels
    
    Helper function to create strings to use in constructing the graph title
    
    Input parameters:
    gender - athlete gender group
    category - type of event
    event - specific distance
    characteristic - athlete characteristic, default None
    
    Returns:
    gender_label - Readable gender string
    event_label - Readable event name string
    category_label - Readable event category string
    unit - Unit for the characteristic to plot
    
    """
    if gender=='M':
        gender_label = "Male"
    else:
        gender_label = "Female"
    if category == 'Road':
        if event == 42195:
            event_label = 'Marathon'
        if event == 21098:
            event_label = 'Half Marathon'
        category_label = 'Road Running'
    if category == 'Track_Flat':
        event_label = str(event)+'m'
        category_label = 'Track (Flat)'
    if category == 'Hurdles':
        event_label = str(event)+'m'+' Hurdles'
        category_label = 'Hurdles'
    if category == 'Steeplechase':
        event_label = str(event)+'m'+' Steeplechase'
        category_label = 'Steeplechase'
    if characteristic == 'Height':
        unit='cm'
    elif characteristic == 'Weight':
        unit='kg'
    elif characteristic == 'Age':
        unit='years'
    elif characteristic == 'BMI':
        unit='m/kg*kg'
    else:
        unit=None
        
    return gender_label, event_label, category_label, unit
In [123]:
def plot_times(event_categories, debug=False):
    """
    Helper function to plot finish times for athletes across all events.
    
    Input parameters:
    event_categories - list of categories of events
    debug            - True/False flag to indicate whether to print out debugging information.
    
    Returns:
    None
    """
    global graph_number
    top_running_data_present = True
    ol_tf_running_data_present = True
    for category in event_categories:
        events = ol_running[category].unique().tolist()
        events.remove(0)
        events.sort()
        if category == 'Road':
            events.append(21098)
        for event in events:
            if debug:
                print("Category = {}, Event={}".format(category, event))
            for gender in ol_tf_running_gender_groups.groups.keys():
                if gender == 'M':
                    top_running_group = top_running_m
                    ol_tf_running_group = ol_tf_running_m
                else:
                    top_running_group = top_running_f
                    ol_tf_running_group = ol_tf_running_f
                plt.figure(figsize=(18, 9))
                # Plot top running time data, if it exists
                try:
                    plt.scatter(top_running_group[top_running_group[category] == event]['Date'], 
                            list(top_running_group[top_running_group[category] == event]['Time']), 
                                color='b', label='Top running times')
                except KeyError:
                    if debug:
                        print("No data from top running times for this event.")
                    top_running_data_present = False
                # Plot each olympic medal colour, if data exists for this event
                if ol_tf_running_group[(ol_tf_running_group[category] == event)].shape[0] != 0:
                    plt.scatter(pd.to_datetime(ol_tf_running_group[(ol_tf_running_group[category] == event) & 
                                                               (ol_tf_running_group['Medal'] == 'G')]['Year'], format='%Y'), 
                                list(ol_tf_running_group[(ol_tf_running_group[category] == event) & 
                                                     (ol_tf_running_group['Medal'] == 'G')]['Time']), 
                                color='gold', label='Olympic gold medal')
                    plt.scatter(pd.to_datetime(ol_tf_running_group[(ol_tf_running_group[category] == event) & 
                                                               (ol_tf_running_group['Medal'] == 'S')]['Year'], format='%Y'), 
                                list(ol_tf_running_group[(ol_tf_running_group[category] == event) & 
                                                     (ol_tf_running_group['Medal'] == 'S')]['Time']), 
                                color='silver', label='Olympic silver medal')
                    plt.scatter(pd.to_datetime(ol_tf_running_group[(ol_tf_running_group[category] == event) & 
                                                               (ol_tf_running_group['Medal'] == 'B')]['Year'], format='%Y'), 
                                list(ol_tf_running_group[(ol_tf_running_group[category] == event) & 
                                                     (ol_tf_running_group['Medal'] == 'B')]['Time']), 
                                color='brown', label='Olympic bronze medal')
                else:
                    ol_tf_running_data_present = False
                    if debug:
                        print("No data from Olympic data set for this event.")
                # Only plot if there is some data
                if  (ol_tf_running_data_present == True) or (top_running_data_present == True):
                    # Construct the graph title and axis labels
                    gender_label, event_label, category_label, unit = build_graph_labels(
                        gender, category, event)
                    plt.xlabel('Year')
                    plt.ylabel('Time')
                    plt.title("Graph {0}: Variation in {1} {2} Times".format(graph_number, gender_label, event_label))
                    plt.legend()
                    #if ol_tf_running_group[(ol_tf_running_group[category] == event)].shape[0] != 0:
                    plt.show()
                    graph_number+=1
                top_running_data_present = True
                ol_tf_running_data_present = True
In [124]:
# A label for the graphs plotted
graph_number = 1

# Plot graphs for all events
event_categories = ['Track_Flat', 'Steeplechase', 'Hurdles', 'Road']
plot_times(event_categories)
<Figure size 1296x648 with 0 Axes>
<Figure size 1296x648 with 0 Axes>

The analysis for the question "How have athletes' performances changed through history?" can be found in section 5.1.

4.2. Question 2: How Have Athletes' Characteristics Changed?

This will be examined by plotting each of the four athlete characteristics (height, weight, age, BMI) against year of competition. Multiple events are plotted on the same axis for ease of comparison. The Numpy polyfit() method is used to plot a best fit line for each event.

The analysis can be found in section 5.2

In [125]:
def plot_athlete_characteristics(event_categories, characteristics, debug=False):
    """
    Helper function to plot athlete characteristics (height, weight, age, BMI) against year of competition.
    Multiple events are plotted on the same axis for comparison.
    A best fit line is added for each event in the plot.
    
    Input parameters:
    event_categories - list of categories of events
    characteristics  - list of athlete characteristics to plot
    debug            - True/False flag to indicate whether to print out debugging information.
    
    Returns:
    None
    """
    meanvals = []
    years = []
    global graph_number
    for c in characteristics:
        for category in event_categories:
            if debug:
                print(category)
            ol_running_groups = ol_running.groupby([category, 'Gender'])
            events = ol_running[category].unique().tolist()
            events.remove(0)
            for gender in ol_running['Gender'].unique().tolist():
                plt.figure(figsize=(18, 9))
                if debug:
                    print(gender)
                for event in events:
                    if debug:
                        print(event)
                    try:
                        ol_running_group = ol_running_groups.get_group((event, gender))
                    except KeyError:
                        if debug:
                            print("No results for this combination in ol_running_groups")
                        continue
                    for year in ol_running_group['Year'].unique().tolist():
                        meanvals.append(ol_running_group[ol_running_group['Year'] == year][c].mean())
                        years.append(year)
                    plt.scatter(years, meanvals, label=str(event)+'m')
                    # Remove NaN values - these will break the fit used  by polyfit() below
                    nullvals = np.isnan(meanvals)
                    for i in np.where(nullvals)[0]:
                        meanvals.pop(i)
                        years.pop(i)
                    z = np.polyfit(years, meanvals, 1)
                    p = np.poly1d(z)
                    plb.plot(years, p(years))
                    meanvals = []
                    years = []
                #Construct the graph title and axis labels
                gender_label, event_label, category_label, unit = build_graph_labels(gender, category, event, c)
                plt.xlabel('Year')
                plt.ylabel(c+'({})'.format(unit))
                plt.title("Graph {0}: Variation in {1} of {2} {3} Olympic Athletes Through History".format(
                        graph_number, c, gender_label, category_label))
                plt.legend()
                plt.show()
                graph_number+=1

Calculate body mass index (BMI):

In [126]:
ol_running.insert(loc=ol_running.columns.get_loc('Weight'), column='BMI', value=0)
ol_running['BMI'] = ol_running['Weight'] / ((ol_running['Height'] / 100)**2)
ol_running.head()
Out[126]:
ID Name Merged_name Ratio RawName Gender Age Height BMI Weight ... Year Season City Sport Hurdles Road Steeplechase Track_Flat Medal Time
26 8 cornelia aalten NaN NaN Cornelia "Cor" Aalten (-Strannood) F 18.0 168.0 NaN NaN ... 1932 Summer Los Angeles Athletics 0 0 0 100 NaN None
98 34 jamale aarrass NaN NaN Jamale (Djamel-) Aarrass (Ahrass-) M 30.0 187.0 21.733535 76.0 ... 2012 Summer London Athletics 0 0 0 1500 NaN None
148 55 antonio abadia beci NaN NaN Antonio Abadia Beci M 26.0 170.0 22.491349 65.0 ... 2016 Summer Rio de Janeiro Athletics 0 0 0 5000 NaN None
190 86 jos manuel abascal gmez NaN NaN Jos Manuel Abascal Gmez M 22.0 182.0 20.227026 67.0 ... 1980 Summer Moskva Athletics 0 0 0 1500 NaN None
191 86 jos manuel abascal gmez josé manuel abascal 95.0 Jos Manuel Abascal Gmez M 26.0 182.0 20.227026 67.0 ... 1984 Summer Los Angeles Athletics 0 0 0 1500 B 00:03:34.300000

5 rows × 23 columns

In [127]:
characteristics = ['Height', 'Weight', 'Age', 'BMI']
plot_athlete_characteristics(event_categories, characteristics)

The analysis can be found in section 5.2

This will be examined by plotting each of the four athlete characteristics against their performance (i.e., time) for each event and gender. Colours are used to indicate medal-winning performances as indicated by the key.

The analysis can be found in section 5.3

In [128]:
ol_running_gender_groups = ol_running.groupby('Gender')
ol_running_m = ol_running_gender_groups.get_group('M')
ol_running_f = ol_running_gender_groups.get_group('F')
In [131]:
def plot_time_vs_characteristics(event_categories, characteristics, debug=False):
    """
    Helper function to plot athlete characteristics (height, weight, age, BMI) against time.
    
    Input parameters:
    event_categories - list of categories of events
    characteristics  - list of athlete characteristics to plot
    debug            - True/False flag to indicate whether to print out debugging information.
    
    Returns:
    None
    """
    global graph_number
    data_present = True
    for c in characteristics:
        for category in event_categories:
            events = ol_running[category].unique().tolist()
            events.remove(0)
            events.sort()
            for event in events:
                if debug:
                    print("Category = {}, Event={}".format(category, event))
                for gender in ol_running_gender_groups.groups.keys():
                    if gender == 'M':
                        ol_running_group = ol_running_m
                    else:
                        ol_running_group = ol_running_f
                    plt.figure(figsize=(18, 9))
                    # Plot each olympic medal colour, if data exists for this event
                    if ol_running_group[(ol_running_group[category] == event)].shape[0] != 0:
                        plt.scatter(ol_running_group[(ol_running_group[category] == event) & 
                                                     (ol_running_group['Medal'] == 'G') &
                                                     (ol_running_group['Time'].notna())][c], 
                                    list(ol_running_group[(ol_running_group[category] == event) & 
                                                          (ol_running_group['Medal'] == 'G') & 
                                                          (ol_running_group['Time'].notna())]['Time']), 
                                    color='gold', label='Olympic gold medal')
                        plt.scatter(ol_running_group[(ol_running_group[category] == event) & 
                                                                   (ol_running_group['Medal'] == 'S') & 
                                                          (ol_running_group['Time'].notna())][c], 
                                    list(ol_running_group[(ol_running_group[category] == event) & 
                                                         (ol_running_group['Medal'] == 'S') & 
                                                          (ol_running_group['Time'].notna())]['Time']), 
                                    color='silver', label='Olympic silver medal')
                        plt.scatter(ol_running_group[(ol_running_group[category] == event) & 
                                                                   (ol_running_group['Medal'] == 'B') & 
                                                          (ol_running_group['Time'].notna())][c], 
                                    list(ol_running_group[(ol_running_group[category] == event) & 
                                                         (ol_running_group['Medal'] == 'B') & 
                                                          (ol_running_group['Time'].notna())]['Time']), 
                                    color='brown', label='Olympic bronze medal')
                        plt.scatter(ol_running_group[(ol_running_group[category] == event) & 
                                                                   (ol_running_group['Medal'].isnull()) & 
                                                          (ol_running_group['Time'].notna())][c], 
                                    list(ol_running_group[(ol_running_group[category] == event) & 
                                                         (ol_running_group['Medal'].isnull()) & 
                                                          (ol_running_group['Time'].notna())]['Time']), 
                                    color='blue', label='No medal')
                    else:
                        data_present = False
                        if debug:
                            print("No data from Olympic data set for this event.")
                    # Only plot if there is some data
                    if  data_present == True:
                        # Construct the graph title and axis labels
                        gender_label, event_label, category_label, unit = build_graph_labels(
                            gender, category, event, c)
                        plt.xlabel(c+'({})'.format(unit))
                        plt.ylabel('Time')
                        plt.title("Graph {0}: Variation in {1} {2} Times with {3}".format(
                            graph_number, gender_label, event_label, c))
                        plt.legend()
                        plt.show()
                        graph_number+=1
                    data_present = True

Are the relationships the same if we split the data into 20 year groups?

In [132]:
# Create a set of year groups throughout Olympic history
year_groups = np.arange(1896, 2020, 20)
year_groups[-1] += 1 # To include 2016 Games
year_groups
Out[132]:
array([1896, 1916, 1936, 1956, 1976, 1996, 2017])
In [133]:
def plot_time_vs_characteristics_time_groups(event_categories, characteristics, debug=False):
    """
    Helper function to plot athlete characteristics (height, weight, age, BMI) against time.
    This function uses colour codes to show the 20-year time period into which a performaance falls.
    
    Input parameters:
    event_categories - list of categories of events
    characteristics  - list of athlete characteristics to plot
    debug            - True/False flag to indicate whether to print out debugging information.
    
    Returns:
    None
    """
    global graph_number
    data_present = True
    for c in characteristics:
        for category in event_categories:
            events = ol_running[category].unique().tolist()
            events.remove(0)
            events.sort()
            for event in events:
                if debug:
                    print("Category = {}, Event={}".format(category, event))
                for gender in ol_running_gender_groups.groups.keys():
                    if gender == 'M':
                        ol_running_group = ol_running_m
                    else:
                        ol_running_group = ol_running_f
                    plt.figure(figsize=(18, 9))
                    # Plot
                    if ol_running_group[(ol_running_group[category] == event)].shape[0] != 0:
                        for y in range(len(year_groups)-1):
                            plt.scatter(ol_running_group[(ol_running_group[category] == event) & 
                                                         (ol_running_group['Year'] >= year_groups[y]) &
                                                         (ol_running_group['Year'] < year_groups[y+1]) &
                                                         (ol_running_group['Time'].notna())][c], 
                                        list(ol_running_group[(ol_running_group[category] == event) & 
                                                              (ol_running_group['Year'] >= year_groups[y]) &
                                                              (ol_running_group['Year'] < year_groups[y+1]) & 
                                                              (ol_running_group['Time'].notna())]['Time']), 
                                                label='{0} to {1}'.format(year_groups[y], year_groups[y+1]-1))
                    else:
                        data_present = False
                        if debug:
                            print("No data from Olympic data set for this event.")
                    # Only plot if there is some data
                    if  data_present == True:
                        # Construct the graph title and axis labels
                        gender_label, event_label, category_label, unit = build_graph_labels(
                            gender, category, event, c)
                        plt.xlabel(c+'({})'.format(unit))
                        plt.ylabel('Time')
                        plt.title("Graph {0}: Variation in {1} {2} Times with {3}".format(
                            graph_number, gender_label, event_label, c))
                        plt.legend()
                        plt.show()
                        graph_number+=1
                    data_present = True
In [134]:
# Plot graphs for all events
event_categories = ['Track_Flat', 'Steeplechase', 'Hurdles', 'Road']
plot_time_vs_characteristics(event_categories, characteristics)
<Figure size 1296x648 with 0 Axes>
<Figure size 1296x648 with 0 Axes>
<Figure size 1296x648 with 0 Axes>
<Figure size 1296x648 with 0 Axes>
<Figure size 1296x648 with 0 Axes>
<Figure size 1296x648 with 0 Axes>
<Figure size 1296x648 with 0 Axes>
<Figure size 1296x648 with 0 Axes>
In [135]:
plot_time_vs_characteristics_time_groups(event_categories, characteristics)
<Figure size 1296x648 with 0 Axes>
<Figure size 1296x648 with 0 Axes>
<Figure size 1296x648 with 0 Axes>
<Figure size 1296x648 with 0 Axes>
<Figure size 1296x648 with 0 Axes>
<Figure size 1296x648 with 0 Axes>
<Figure size 1296x648 with 0 Axes>
<Figure size 1296x648 with 0 Axes>

The analysis can be found in section 5.3

5. Evaluate Results

5.1. How have athletes' performances changed through history?

This is an analysis for the results in section 4.1

5.2. How have athletes' characteristics changed?

This is the analysis for the results in section 4.2.

This is the analysis for the results in section 4.3.

In [ ]:
# What do the visualisations and data tell us? Does this answer the question?
# Are any values difficult to predict?
# Are we over/under fitting?
# Can we re-select our data set to optimise the model?

6. Deployment

This analysis is published on GitHub (https://github.com/mattjezza/ds-proj1-t2-elite-athletics) and summarised in a post on Medium.